Main Question
Our main question:
- How does the inclusion of different numbers of age groups influence the results of an analysis of right to carry laws and violence rates?
Disclaimer: The purpose of the Open Case Studies project is to demonstrate the use of various data science methods, tools, and software in the context of messy, real-world data. A given case study does not cover all aspects of the research process, is not claiming to be the most appropriate way to analyze a given data set, and should not be used in the context of making policy decisions without external consultation from scientific experts.
This work is licensed under the Creative Commons Attribution-NonCommercial 3.0 (CC BY-NC 3.0) United States License.
This case study will introduce the topic of multicolinearity. We will do so by showcasing a real world example where multicolinearity in part resulted in historically contriversial and conflicting findings about the influence of the adoption of right-to-carry (RTC) concealed handgun laws on violent crime rates in the United States.
We will focus on two articles:
This has been a controversial topic as many other articles also had conflicting results. See here for a list of studies.
The Donohue, et al. article discusses how there are many other important methodolical aspects besides multicolinearity that could account for the historically conflicting results in these previous papers.
In fact, nearly every aspect of the data analysis process was different between the Donohue, et al. analysis and the Lott and Mustard analysis.
However, we will focus particularly on multicolinearity and we will explore how it can influence linear regression analyses and result in different conclusions.
This analysis will demonstrate how methodological details can be critically influential for our overall conclusions and can result in important policy related consequences. This article will provide a basis for the motivation.
John J. Donohue et al., Right‐to‐Carry Laws and Violent Crime: A Comprehensive Assessment Using Panel Data and a State‐Level Synthetic Control Analysis. Journal of Empirical Legal Studies, 16,2 (2019).
David B. Mustard & John Lott. Crime, Deterrence, and Right-to-Carry Concealed Handguns. Coase-Sandor Institute for Law & Economics Working Paper No. 41, (1996).
Here you can see the differences in the data used in the featured RTC articles:
We will perform analyses similar to those in these articles, however we will not try to recreate them, instead we will simplify our analysis to allow us to focus on multicolinearity.
Therefore we will use a subset of the listed explanatory variables and they will be consistent for both analyses that we will perform, with the exception that one analysis will have 6 demographic variables like the analysis in the Donohue, et al. article and the other will have 36 demogrpahic variables like the analysis in the Lott and Mustard article.
Our main question:
Statistical Learning Objectives:
In this case study, students will learn:
1) what multicolinearity is and how it can influence linear regression coefficients
2) how to look for the presence of multicolinarity
3) the difference between multicolinearity and correlation
Data science Learning Objectives:
We will especially focus on using packages and functions from the Tidyverse, such as dplyr and ggplot2. The tidyverse is a library of packages created by RStudio. While some students may be familiar with previous R programming packages, these packages make data science in R especially efficient.
So what exactly is a right-to-carry law?
It is a law thatspecifies if and how citizens are allowed to have a firearm on their person or nearby (for example in the citizen’s car) in public.
The Second Amendment to the United States Constitution guarantees the right to “keep and bear arms”. The amendment was ratified in 1791 as part of the Bill of Rights.
However, there are no federal laws about carrying firearms in public.
These laws are created and enforced at the state level. Sates vary greatly in their laws about the right to carry firearms. Some require extensive effort to obtain a permit to legally carry a firearm, while other states require very minimal effort to legally carry a firearm.
According to Wikipedia about the history of right-to-carry policies in the United States:
Public perception on concealed carry vs open carry has largely flipped. In the early days of the United States, open carrying of firearms, long guns and revolvers was a common and well-accepted practice. Seeing guns carried openly was not considered to be any cause for alarm. Therefore, anyone who would carry a firearm but attempt to conceal it was considered to have something to hide, and presumed to be a criminal. For this reason, concealed carry was denounced as a detestable practice in the early days of the United States.
Concealed weapons bans were passed in Kentucky and Louisiana in 1813. (In those days open carry of weapons for self-defense was considered acceptable; concealed carry was denounced as the practice of criminals.) By 1859, Indiana, Tennessee, Virginia, Alabama, and Ohio had followed suit. By the end of the nineteenth century, similar laws were passed in places such as Texas, Florida, and Oklahoma, which protected some gun rights in their state constitutions. Before the mid 1900s, most U.S. states had passed concealed carry laws rather than banning weapons completely. Until the late 1990s, many Southern states were either “No-Issue” or “Restrictive May-Issue”. Since then, these states have largely enacted “Shall-Issue” licensing laws, with numerous states legalizing “Unrestricted concealed carry”.
See here for more information.
Here are the general categories of Right to Carry Laws:
You can see that none of the fifty states have no-issue laws currently, meaning that all states allow the right to carry firearms at least in some way, however the level of restrictions is dramatically different from one state to another.
Here you can see how these laws have changed over time around the country:
There is variation from state to state even within the same general category:
For example here are the current carry laws in Idaho which is considered an “Unrestricted - no permit required” state:
Idaho permits the open carrying of firearms.
Idaho law permits both residents and non-residents who are at least 18 years old to carry concealed weapons, without a carry license, outside the limits of or confines of any city, provided the person is not otherwise disqualified from being issued a license to carry.
A person may also carry concealed weapons on or about his or her person, without a license, in the person’s own place of abode or fixed place of business, on property in which the person has any ownership or leasehold interest, or on private property where the person has permission to carry from any person who has an ownership or leasehold interest in that property.
State law also allows any resident of Idaho or a current member of the armed forces of the United States to carry a concealed handgun without a license to carry, provided the person is over 18 years old and not disqualified from being issued a license to carry concealed weapons under state law. An amendment to state law that takes effect on July 1, 2020 changes the reference in the above law from “a resident of Idaho” to “any citizen of the United States.”
And here are the current carry laws in Arizona which is also considered an “Unrestricted- - no permit required” state:
Arizona respects the right of law abiding citizens to openly carry a handgun.
Any person 21 years of age or older, who is not prohibited possessor, may carry a weapon openly or concealed without the need for a license. Any person carrying without a license must acknowledge and comply with the demands of a law enforcement officer when asked if he/she is carrying a concealed deadly weapon, if the officer has initiated an “investigation” such as a traffic stop.
Notice that citizens in Idaho only need to be 18 to carry a firearm, whereas they must be 21 in Arizona.
In contrast here is an example of current carry laws in Maryland which is considered a “Rights Restricted-Very Limited Issue” state:
Carrying and Transportation in Vehicles It is unlawful for any person without a permit to wear or carry a handgun, openly or concealed, upon or about his person. It is also unlawful for any person to knowingly transport a handgun in any vehicle traveling on public roads, highways, waterways or airways, or upon roads or parking lots generally used by the public. This does not apply to any person wearing, carrying or transporting a handgun within the confines of real estate owned or leased by him, or on which he resides, or within the confines of a business establishment owned or leased by him.
Permit To Carry Application for a permit to carry a handgun is made to the Secretary of State Police. In addition to the printed application form, the applicant should submit a notarized letter stating the reasons why he is applying for a permit.
avocado….Right to carry and covid masks?
There are some important considerations regarding this data analysis to keep in mind:
We do not use all of the data used by either the Lott and Mustard or Donohue, et al. analyses, nor do we perform the same analysis of each article. We instead perform a much simpler analysis with less variables for the purposes of illustration of the concept of multicollinearity and its influence on regression coefficients, not to reproduce either analysis.
Because our analysis is an oversimplification, our analysis should not be used for determining policy changes, instead we suggest that users consult with a specialist.
We would also like to note that…AVOCADO It is important that we do not treat race as an objective measure. Despite this, it can be used to advance scientific inquiry. For more information on this topic, we have included a link to a paper on the use of race as a measure in epidemiology.
We will begin by loading the packages that we will need:
library(here)
library(readxl)
library(readr)
library(pdftools)
library(dplyr)
library(magrittr)
library(tidyr)
library(stringr)
library(car) # vif function
library(plm) # fixed effect model, linear regression
library(broom) # tidy output
library(tidyverse) # general wrangling functions
library(cowplot) # to produce plot of plots
library(GGally)
library(ggrepel)
library(scales)
library(latex2exp)
library(viridis)
library(ggcorrplot)
library(rsample)
set.seed(999)| Package | Use |
|---|---|
| here | to easily load and save data |
| readr | to import the CSV file data |
| [car] | to calculate vif values |
The first time we use a function, we will use the :: to indicate which package we are using. Unless we have overlapping function names, this is not necessary, but we will include it here to be informative about where the functions we will use come from.
Below is a table from the Donohue, et al. paper that shows the data used in both analyses, where DAW stands for Donohue, et al. and LM stands for Lott and Mustard.
We will be using a subset of these variables, which are highlighted in green:
To obtain information about age, sex, and race, and overall population we will use US Census Bureau data, just like both of the articles. The cesnus data is available for different time spans. Here are the links for the years used in our analysis. We will use data from 1977 to 2010.
| Data | Link |
|---|---|
| years 1977 to 1979 | link |
| years 1980 to 1989 | link * county data was used for this decade |
| years 1990 to 1999 | link |
| years 2000 to 2010 | link technical documentation |
To import the data we will use the read_csv() function of the readr package for the csv files. In some decades, there are separate files for each year, we will read each of these together using the base list.files() function to get all of the names for each file and then the map() function of the purrr package to apply the read_csv() function on all of the file paths in the list created by list.files(). For years that are txt files we will use read_table2() also fo the readr package. The read_table2() function, unlike the read_table(), allows for any number of whitespace characters between columns, and the lines can be of different lengths.
AVOCADO I am a bit confused about the last decade… it’s only one file but it seems to need map…
dem_77_79 <- read_csv("docs/Demographics/Decade_1970/pe-19.csv", skip = 5)
dem_80_89 <- list.files(recursive = TRUE,
path = "docs/Demographics/Decade_1980/",
pattern = "*.csv",
full.names = TRUE) %>%
map(~read_csv(., skip=5))
dem_90_99 <- list.files(recursive = TRUE,
path = "docs/Demographics/Decade_1990/",
pattern = "*.txt",
full.names = TRUE) %>%
map(~read_table2(., skip = 14))
dem_90_99 <- dem_90_99 %>%
map_df(bind_rows)
dem_00_10_2 <- read_csv("docs/Demographics/Decade_2000/st-est00int-alldata.csv")
dem_00_10 <- list.files(recursive = TRUE,
path = "docs/Demographics/Decade_2000/",
pattern = "*.csv",
full.names = TRUE) %>%
map(~read_csv(.))
head(dem_00_10)[[1]]
# A tibble: 62,244 x 21
REGION DIVISION STATE NAME SEX ORIGIN RACE AGEGRP ESTIMATESBASE20…
<dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0 0 0 Unit… 0 0 0 0 281424600
2 0 0 0 Unit… 0 0 0 1 19176154
3 0 0 0 Unit… 0 0 0 2 20549855
4 0 0 0 Unit… 0 0 0 3 20528425
5 0 0 0 Unit… 0 0 0 4 20218782
6 0 0 0 Unit… 0 0 0 5 18962964
7 0 0 0 Unit… 0 0 0 6 19381792
8 0 0 0 Unit… 0 0 0 7 20511067
9 0 0 0 Unit… 0 0 0 8 22707390
10 0 0 0 Unit… 0 0 0 9 22442442
# … with 62,234 more rows, and 12 more variables: POPESTIMATE2000 <dbl>,
# POPESTIMATE2001 <dbl>, POPESTIMATE2002 <dbl>, POPESTIMATE2003 <dbl>,
# POPESTIMATE2004 <dbl>, POPESTIMATE2005 <dbl>, POPESTIMATE2006 <dbl>,
# POPESTIMATE2007 <dbl>, POPESTIMATE2008 <dbl>, POPESTIMATE2009 <dbl>,
# CENSUS2010POP <dbl>, POPESTIMATE2010 <dbl>
Notice that the STATE variable for the demographic data is numeric. That is because it is encoded by Federal Information Processing Standard (FIPS) state codes{target="_blank". Thus we also need to import data about FIPS encoding so that we can identify what data corresponds to what state.
##State FIPS codes
The following data was downloaded from the US Census Bureau.
To import the data we will use the read_xls() function of the readxl package. Since the first five lines of this excel is information about the source of the data and when it was released, we need to skip importing these lines using the skip argument so that the data has the same number of columns for each row.
# A tibble: 64 x 4
Region Division `State\n(FIPS)` Name
<chr> <chr> <chr> <chr>
1 1 0 00 Northeast Region
2 1 1 00 New England Division
3 1 1 09 Connecticut
4 1 1 23 Maine
5 1 1 25 Massachusetts
6 1 1 33 New Hampshire
7 1 1 44 Rhode Island
8 1 1 50 Vermont
9 1 2 00 Middle Atlantic Division
10 1 2 34 New Jersey
# … with 54 more rows
The following data was downloaded from the Federal Bureau of Investigation.
The read_csv() function of the readr package guesses what the class is for each variable, but sometimes it makes mistakes. It is good to specify the class for variables if you know them. We know that we want the variables about male and female counts to be numeric. We can specify that using the col_types = argument. See here and here for more information.
ps_data <- read_csv("docs/Police_staffing/pe_1960_2018.csv")
ps_data <- read_csv("docs/Police_staffing/pe_1960_2018.csv",
col_types = cols(male_total_ct = "n",
female_total_ct = "n"))
ps_data <- read_csv("docs/Police_staffing/pe_1960_2018.csv",
col_types = cols(male_total_ct = col_double(),
female_total_ct = col_double()))
head(ps_data) # A tibble: 6 x 21
data_year ori pub_agency_name pub_agency_unit state_abbr division_name
<dbl> <chr> <chr> <chr> <chr> <chr>
1 1960 AK02… Alcohol Bevera… <NA> AK Pacific
2 1960 AL00… Homewood <NA> AL East South C…
3 1960 AL01… Coffeeville <NA> AL East South C…
4 1960 AL01… Coffee <NA> AL East South C…
5 1960 AL02… Mentone <NA> AL East South C…
6 1960 AL03… Greensboro <NA> AL East South C…
# … with 15 more variables: region_name <chr>, county_name <chr>,
# agency_type_name <chr>, population_group_desc <chr>, population <dbl>,
# male_officer_ct <dbl>, male_civilian_ct <dbl>, male_total_ct <dbl>,
# female_officer_ct <lgl>, female_civilian_ct <lgl>, female_total_ct <dbl>,
# officer_ct <lgl>, civilian_ct <lgl>, total_pe_ct <lgl>,
# pe_ct_per_1000 <lgl>
The following data was downloaded from the U.S. Bureau of Labor Statistics.
There are excel files for each state. As you can see, there are many rows to skip to make sure that there are the same number of columns for each row. We can also see that the state name is located in a couple of the first rows.
We can also see that here if we just try to read in the files directly.
ue_rate_data <- list.files(recursive = TRUE,
path = "docs/Unemployment",
pattern = "*.xlsx",
full.names = TRUE) %>%
map(~read_xlsx(.))
head(ue_rate_data)[1][[1]]
# A tibble: 55 x 14
`Local Area Une… ...2 ...3 ...4 ...5 ...6 ...7 ...8 ...9 ...10 ...11
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 Original Data V… <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
2 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
3 Series Id: LAUS… <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
4 Not Seasonally … <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
5 Area: Alab… <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
6 Area Type: Stat… <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
7 State/Region/Di… Alab… <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
8 Measure: unem… <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
9 Years: 1977… <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
10 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
# … with 45 more rows, and 3 more variables: ...12 <chr>, ...13 <chr>,
# ...14 <chr>
So now we will skip the first 10 lines. And also create a names tibble that contains only the cell with the state information.
ue_rate_data <- list.files(recursive = TRUE,
path = "docs/Unemployment",
pattern = "*.xlsx",
full.names = TRUE) %>%
map(~read_xlsx(., skip = 10))
head(ue_rate_data[1])[[1]]
# A tibble: 44 x 14
Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1977 7.5 9 7.7 7.2 6.8 8.6 8 7.8 6.7 6.3 6.3 6
2 1978 7.1 6.9 6.2 5.4 5.1 6.9 6.7 6.7 6.5 6.3 6.3 6.5
3 1979 6.7 7.5 6.9 6.6 6.4 8.4 7.7 7.8 7.1 7.2 6.9 6.7
4 1980 7.7 7.8 7.4 7.4 8.4 9.7 10.4 10.3 9.3 9.6 9.4 9
5 1981 10 10.3 9.5 9.1 9.4 11.1 10.4 10.9 10.8 11.7 11.5 11.8
6 1982 13.2 13.2 12.9 12.6 12.8 14.5 14.7 14.8 14.7 15.1 15.4 15.3
7 1983 16 16 14.5 13.7 13.3 14.6 13.9 13.8 13.2 12.8 12.1 11.8
8 1984 12.5 12.4 11.4 10.8 10.1 11.3 11.5 11.3 10.8 10.2 9.7 10.1
9 1985 10.7 10.5 9.8 8.7 8.4 9.6 9.2 8.8 8.6 8.6 8.4 8.7
10 1986 9.3 10.4 10.1 9.4 9.4 10.5 9.7 9.6 9.7 9.7 9.6 9
# … with 34 more rows, and 1 more variable: Annual <dbl>
To get the state name for each file using the map() function to perform functions across all of the files, we will specifically import only a small range of cells using the range = argument and then grab the cell that has state information based on it’s location within the range of cells imported using c() and then use the base unlist() function to unlist the list that this creates.
ue_rate_names <- list.files(recursive = TRUE,
path = "docs/Unemployment",
pattern = "*.xlsx",
full.names = TRUE) %>%
map(~read_xlsx(., range = "B4:B6")) %>%
map(., c(1,2)) %>%
unlist()
ue_rate_names [1] "Alabama" "Alaska" "Arizona"
[4] "Arkansas" "California" "Colorado"
[7] "Connecticut" "Delaware" "District of Columbia"
[10] "Florida" "Georgia" "Hawaii"
[13] "Idaho" "Illinois" "Indiana"
[16] "Iowa" "Kansas" "Kentucky"
[19] "Louisiana" "Maine" "Maryland"
[22] "Massachusetts" "Michigan" "Minnesota"
[25] "Mississippi" "Missouri" "Montana"
[28] "Nebraska" "Nevada" "New Hampshire"
[31] "New Jersey" "New Mexico" "New York"
[34] "North Carolina" "North Dakota" "Ohio"
[37] "Oklahoma" "Oregon" "Pennsylvania"
[40] "Rhode Island" "South Carolina" "South Dakota"
[43] "Tennessee" "Texas" "Utah"
[46] "Vermont" "Virginia" "Washington"
[49] "West Virginia" "Wisconsin" "Wyoming"
Extracted from Table 21 from US Census Bureau Poverty Data
AVOCado strange issue
#**persistent warning from unknown origin** https://community.rstudio.com/t/persistent-unknown-or-uninitialised-column-warnings/64879
#solution to above is alledgedly: "In any case the suggested approach is to initialize the column"
poverty_rate_data <- read_xls("docs/Poverty/hstpov21.xls", skip=2) #This may cause initialization issue, not easily reproducible (even after restarting R)
head(poverty_rate_data)# A tibble: 6 x 6
`NOTE: Number in thousa… ...2 ...3 ...4 ...5 ...6
<chr> <chr> <chr> <chr> <chr> <chr>
1 2018 <NA> <NA> <NA> <NA> <NA>
2 STATE Total Number "Standard\n… Percent "Standard\ner…
3 Alabama 4877 779 "65" 16 "1.3"
4 Alaska 720 94 "9" 13.1 "1.2"
5 Arizona 7241 929 "80" 12.80000000… "1.1000000000…
6 Arkansas 2912 462 "38" 15.9 "1.3"
We can see that this will require some wranlging to make the data more usable.
Violent crime data was obtained from here This data is a bit trickier because of spaces and / in the column names, thus the read_lines() function of the readr package works better than the read_csv() function.
crime_data <- read_lines("docs/Crime/CrimeStatebyState.csv", skip = 2, skip_empty_rows = TRUE)
head(crime_data)[1] "Estimated crime in Alabama"
[2] "\n,,National or state crime,,,,,,,"
[3] "\n,,Violent crime,,,,,,,"
[4] "\nYear,Population,Violent crime total,Murder and nonnegligent Manslaughter,Legacy rape /1,Revised rape /2,Robbery,Aggravated assault,"
[5] "1977, 3690000, 15293, 524, 929,, 3572, 10268 "
[6] "1978, 3742000, 15682, 499, 954,, 3708, 10521 "
We can see that this data will also require some wranlging to make it more usable.
This data is extracted from table in Donohue paper. We will use the function pdf_text() of the pdftools package to import the pdf document.
if(!file.exists(here("docs", "w23510.pdf"))){
url <- "https://www.nber.org/papers/w23510.pdf"
utils::download.file(url, here("docs", "w23510.pdf"))
}
DAWpaper <- pdf_text(here("docs", "w23510.pdf"))
head(DAWpaper[1])[1] " NBER WORKING PAPER SERIES\n RIGHT-TO-CARRY LAWS AND VIOLENT CRIME:\n A COMPREHENSIVE ASSESSMENT USING PANEL DATA AND\n A STATE-LEVEL SYNTHETIC CONTROL ANALYSIS\n John J. Donohue\n Abhay Aneja\n Kyle D. Weber\n Working Paper 23510\n http://www.nber.org/papers/w23510\n NATIONAL BUREAU OF ECONOMIC RESEARCH\n 1050 Massachusetts Avenue\n Cambridge, MA 02138\n June 2017, Revised November 2018\nPreviously circulated as \"Right-to-Carry Laws and Violent Crime: A Comprehensive Assessment\nUsing Panel Data and a State-Level Synthetic Controls Analysis.\" We thank Dan Ho, Stefano\nDellaVigna, Rob Tibshirani, Trevor Hastie, StefanWager, Jeff Strnad, and participants at the\n2011 Conference of Empirical Legal Studies (CELS), 2012 American Law and Economics\nAssociation (ALEA) Annual Meeting, 2013 Canadian Law and Economics Association (CLEA)\nAnnual Meeting, 2015 NBER Summer Institute (Crime), and the Stanford Law School faculty\nworkshop for their comments and helpful suggestions. Financial support was provided by\nStanford Law School. We are indebted to Alberto Abadie, Alexis Diamond, and Jens\nHainmueller for their work developing the synthetic control algorithm and programming the Stata\nmodule used in this paper and for their helpful comments. The authors would also like to thank\nAlex Albright, Andrew Baker, Jacob Dorn, Bhargav Gopal, Crystal Huang, Mira Korb, Haksoo\nLee, Isaac Rabbani, Akshay Rao, Vikram Rao, Henrik Sachs and Sidharth Sah who provided\nexcellent research assistance, as well as Addis O’Connor and Alex Chekholko at the Research\nComputing division of Stanford’s Information Technology Services for their technical support.\nThe views expressed herein are those of the author and do not necessarily reflect the views of the\nNational Bureau of Economic Research.\nNBER working papers are circulated for discussion and comment purposes. They have not been\npeer-reviewed or been subject to the review by the NBER Board of Directors that accompanies\nofficial NBER publications.\n© 2017 by John J. Donohue, Abhay Aneja, and Kyle D. Weber. All rights reserved. Short\nsections of text, not to exceed two paragraphs, may be quoted without explicit permission\nprovided that full credit, including © notice, is given to the source.\n"
Again, this data will also require quite a bit of wrangling.
Let’s take a look at our state FIPS data to see if it needs any cleaning or reshaping.
# A tibble: 6 x 4
Region Division `State\n(FIPS)` Name
<chr> <chr> <chr> <chr>
1 1 0 00 Northeast Region
2 1 1 00 New England Division
3 1 1 09 Connecticut
4 1 1 23 Maine
5 1 1 25 Massachusetts
6 1 1 33 New Hampshire
We only need the last two columns, but we might want to rename them. The Name variable is vague. The variable with the FIPS code is called State\n(FIPS). To get rid of the new line in this variable name and to change the Name variable to something more informative, we will use the rename() function of the dplyr package. To use this function, we need to list the new name first followed by = and then the existing variable. We can rename multiple variables at the same time by using a comma to separate the variables we are renaming. We will use the select() function also of the dplyr package just to keep these variables, and we will filter out the rows with FIPS values of 00 with the filter() function, agian also part of the dplyr package. we will specify that we want STATEFP values that are not equal to 00 by using this operator: !=. We will also use the double pipe operator %<>% of the magrittr package which allows us to use data as iuput and then reassign it after we peform sum functions using it.
STATE_FIPS %<>%
dplyr::rename( STATEFP = `State\n(FIPS)`,
STATE = Name) %>%
dplyr::select(STATEFP, STATE) %>%
dplyr::filter(STATEFP != "00")
STATE_FIPS# A tibble: 51 x 2
STATEFP STATE
<chr> <chr>
1 09 Connecticut
2 23 Maine
3 25 Massachusetts
4 33 New Hampshire
5 44 Rhode Island
6 50 Vermont
7 34 New Jersey
8 36 New York
9 42 Pennsylvania
10 17 Illinois
# … with 41 more rows
Now let’s take a look at our demographic data across the decades that we wish to study. If you have very wide data (meaning it has many columns), one way to view the data so that you can see all of the columns at the same time is to use the glimpse() function of the dplyr package.
Taking a look at the first decade of data, we can see that the Race/Sex Indicator contains two types of data, the race and the sex. This does not follow the tidy data philosophy, where each cell of a tibble should only contain one piece of information. Typically one might think of using the separate() function of the tidyr package to split this variable into two. However, one of the race values is Other races and since this also has a space, this makes separating this data more tricky.
Instead we will use the str_extract() function of the stringr package and the mutate() function of the dplyr package. The “mutate()” will allow us to create new variables, and “str_extract()” function will allow us to match specific patterns and pull out matches to those patterns. Therefore, if the Race/Sex Indicator value is Other races male and if we extract patterns matching either "male" or "female" which we can specify like this pattern = "male|female" then, the value will be male.
First we need to rename the Race/Sex Indicator varaible to not have spaces so that it is compatible with the str_extract() function.
We also want to rename a couple of variables to be simpler and filter the data to only include the years of the data we are interested in, as well as remove some variables that we dont need like the FIPS State Code. We can remove variables by using the select() function with a - minus sign in front of the variable we wish to remove.
Rows: 3,060
Columns: 22
$ `Year of Estimate` <dbl> 1970, 1970, 1970, 1970, 1970, 1970, 1970, 1970, …
$ `FIPS State Code` <chr> "01", "01", "01", "01", "01", "01", "02", "02", …
$ `State Name` <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Ala…
$ `Race/Sex Indicator` <chr> "White male", "White female", "Black male", "Bla…
$ `Under 5 years` <dbl> 105856, 100613, 47403, 47079, 244, 250, 12382, 1…
$ `5 to 9 years` <dbl> 120876, 115194, 55443, 54851, 255, 251, 13888, 1…
$ `10 to 14 years` <dbl> 129091, 122352, 60427, 60065, 253, 245, 13255, 1…
$ `15 to 19 years` <dbl> 119500, 116107, 52921, 55144, 281, 254, 11179, 9…
$ `20 to 24 years` <dbl> 103665, 108513, 29948, 35165, 413, 331, 20237, 1…
$ `25 to 29 years` <dbl> 86538, 88359, 19535, 23662, 239, 302, 12538, 107…
$ `30 to 34 years` <dbl> 74452, 77595, 17196, 22021, 236, 284, 10331, 865…
$ `35 to 39 years` <dbl> 71511, 74941, 16654, 22248, 161, 279, 9548, 7510…
$ `40 to 44 years` <dbl> 75242, 78908, 17564, 24249, 127, 253, 8282, 6353…
$ `45 to 49 years` <dbl> 73874, 78589, 18186, 23028, 108, 148, 6995, 5820…
$ `50 to 54 years` <dbl> 68048, 72481, 17618, 22104, 95, 100, 5609, 4494,…
$ `55 to 59 years` <dbl> 61071, 67699, 18118, 21909, 88, 93, 4029, 2986, …
$ `60 to 64 years` <dbl> 52361, 61065, 16456, 20068, 69, 94, 2392, 1830, …
$ `65 to 69 years` <dbl> 38977, 49685, 14498, 19364, 54, 73, 1292, 965, 2…
$ `70 to 74 years` <dbl> 26767, 37227, 9541, 12509, 70, 66, 602, 496, 8, …
$ `75 to 79 years` <dbl> 17504, 27163, 6030, 8291, 31, 52, 326, 305, 1, 5…
$ `80 to 84 years` <dbl> 9937, 16470, 3485, 5031, 37, 30, 211, 186, 4, 5,…
$ `85 years and over` <dbl> 5616, 10445, 2448, 4035, 76, 29, 143, 126, 19, 4…
dem_77_79 <- dem_77_79 %>%
rename("race_sex" =`Race/Sex Indicator`) %>%
mutate(SEX = str_extract(race_sex, "male|female"),
RACE = str_extract(race_sex, "Black|White|Other"))%>%
select(-`FIPS State Code`, -`race_sex`) %>%
rename("YEAR" = `Year of Estimate`,
"STATE" = `State Name`) %>%
filter(YEAR %in% 1977:1979)
glimpse(dem_77_79)Rows: 918
Columns: 22
$ YEAR <dbl> 1977, 1977, 1977, 1977, 1977, 1977, 1977, 1977, 1…
$ STATE <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alab…
$ `Under 5 years` <dbl> 98814, 94595, 46201, 45784, 590, 621, 14316, 1353…
$ `5 to 9 years` <dbl> 113365, 107395, 50097, 49329, 672, 660, 14621, 13…
$ `10 to 14 years` <dbl> 123107, 116182, 54925, 53955, 677, 653, 14795, 13…
$ `15 to 19 years` <dbl> 135343, 130433, 58468, 59926, 674, 605, 15207, 13…
$ `20 to 24 years` <dbl> 126053, 125352, 43898, 51433, 722, 773, 20106, 16…
$ `25 to 29 years` <dbl> 111547, 112471, 31014, 36648, 638, 835, 20444, 18…
$ `30 to 34 years` <dbl> 100674, 101543, 22528, 26694, 571, 766, 17514, 15…
$ `35 to 39 years` <dbl> 81038, 83369, 17473, 22213, 498, 586, 13098, 1069…
$ `40 to 44 years` <dbl> 75042, 77793, 16446, 22146, 356, 479, 10067, 7935…
$ `45 to 49 years` <dbl> 76296, 79753, 16578, 22576, 295, 432, 8460, 6848,…
$ `50 to 54 years` <dbl> 74844, 81079, 17117, 23028, 206, 326, 7268, 5914,…
$ `55 to 59 years` <dbl> 67785, 75905, 16437, 21435, 166, 213, 5398, 4485,…
$ `60 to 64 years` <dbl> 58853, 69406, 16276, 21075, 145, 174, 3349, 2708,…
$ `65 to 69 years` <dbl> 48848, 62430, 15837, 21126, 107, 173, 1714, 1468,…
$ `70 to 74 years` <dbl> 34475, 50075, 11450, 16028, 90, 138, 915, 928, 22…
$ `75 to 79 years` <dbl> 20977, 34027, 7601, 10825, 53, 106, 500, 493, 10,…
$ `80 to 84 years` <dbl> 10831, 21483, 3896, 6272, 25, 49, 237, 268, 4, 7,…
$ `85 years and over` <dbl> 6683, 15729, 2667, 5426, 33, 41, 153, 211, 11, 6,…
$ SEX <chr> "male", "female", "male", "female", "male", "fema…
$ RACE <chr> "White", "White", "Black", "Black", "Other", "Oth…
That’s looking pretty good! We also want to take all the age group variabels and make one variable that is the age group name and one that is the value of the population count for that age group. To do this we will use the pivot_longer() function of the tidyr package. To use this function, we need to use the cols argument to indicate which columns we want to pivot. We also name the new variables we will create with the names_to and values_to arguments. The names_to will be the name of the variable that will identify each age group and values_to will be the name of the variable that contains the corresponding population values.
dem_77_79 <- dem_77_79 %>%
pivot_longer(cols=contains("years"),
names_to = "AGE_GROUP",
values_to = "SUB_POP")
glimpse(dem_77_79)Rows: 16,524
Columns: 6
$ YEAR <dbl> 1977, 1977, 1977, 1977, 1977, 1977, 1977, 1977, 1977, 1977,…
$ STATE <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabama", "Ala…
$ SEX <chr> "male", "male", "male", "male", "male", "male", "male", "ma…
$ RACE <chr> "White", "White", "White", "White", "White", "White", "Whit…
$ AGE_GROUP <chr> "Under 5 years", "5 to 9 years", "10 to 14 years", "15 to 1…
$ SUB_POP <dbl> 98814, 113365, 123107, 135343, 126053, 111547, 100674, 8103…
We also want to get data about the total population for the state for each year.
To do so we can sum all the values for the SUB_POP variable that we just created. To do this we can use the group_by and summarize() functions of the dplyr package. The group_by() function specifies how we want to calculate our sum, that we would like to calculate it for each year and each state individually. Thus, all the values that have the same STATE and YEAR values will be summed together, rather than summing using all of the values in the SUB_POP variable. The .groups argument allows us to remove the grouping after we peform the calculation with summarize().
pop_77_79 <- dem_77_79 %>%
group_by(YEAR, STATE) %>%
summarize("TOT_POP" = sum(SUB_POP), .groups = "drop")
pop_77_79 # A tibble: 153 x 3
YEAR STATE TOT_POP
<dbl> <chr> <dbl>
1 1977 Alabama 3782571
2 1977 Alaska 397220
3 1977 Arizona 2427296
4 1977 Arkansas 2207195
5 1977 California 22350332
6 1977 Colorado 2696179
7 1977 Connecticut 3088745
8 1977 Delaware 594815
9 1977 District of Columbia 681766
10 1977 Florida 8888806
# … with 143 more rows
Now we will add the population value to the demographic tibble using the left_join() function of the dplyr package. It is imporant that we specify how this should be done, that the YEAR and STATE variable vlaues should match eachother. This will place the dem_77_79 variables to the left of the pop_77_79 data.
# A tibble: 16,524 x 7
YEAR STATE SEX RACE AGE_GROUP SUB_POP TOT_POP
<dbl> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 1977 Alabama male White Under 5 years 98814 3782571
2 1977 Alabama male White 5 to 9 years 113365 3782571
3 1977 Alabama male White 10 to 14 years 123107 3782571
4 1977 Alabama male White 15 to 19 years 135343 3782571
5 1977 Alabama male White 20 to 24 years 126053 3782571
6 1977 Alabama male White 25 to 29 years 111547 3782571
7 1977 Alabama male White 30 to 34 years 100674 3782571
8 1977 Alabama male White 35 to 39 years 81038 3782571
9 1977 Alabama male White 40 to 44 years 75042 3782571
10 1977 Alabama male White 45 to 49 years 76296 3782571
# … with 16,514 more rows
We will also calculate the percentage that each group makes up of the total population, by dividing the SUB_POP by the TOT_POP and multiplying by 100 using the mutate() function. we will also remove the other population variables.
dem_77_79 <- dem_77_79 %>%
mutate(PERC_SUB_POP = (SUB_POP/TOT_POP)*100) %>%
select(-SUB_POP, -TOT_POP)
dem_77_79# A tibble: 16,524 x 6
YEAR STATE SEX RACE AGE_GROUP PERC_SUB_POP
<dbl> <chr> <chr> <chr> <chr> <dbl>
1 1977 Alabama male White Under 5 years 2.61
2 1977 Alabama male White 5 to 9 years 3.00
3 1977 Alabama male White 10 to 14 years 3.25
4 1977 Alabama male White 15 to 19 years 3.58
5 1977 Alabama male White 20 to 24 years 3.33
6 1977 Alabama male White 25 to 29 years 2.95
7 1977 Alabama male White 30 to 34 years 2.66
8 1977 Alabama male White 35 to 39 years 2.14
9 1977 Alabama male White 40 to 44 years 1.98
10 1977 Alabama male White 45 to 49 years 2.02
# … with 16,514 more rows
For this decade each year is a separate tibble and they are combined as a list.
[1] "list"
So the first thing we need to do is combine each tibble of the list together. We can do that using the bind_rows() function of dplyr which appends the data together based on the presence of columns with the same name in the different tibbles. We will use the map_df() function of the purrr package to allow us to do this across each tibble in our list.
Rows: 188,460
Columns: 21
$ `Year of Estimate` <dbl> 1980, 1980, 1980, 1980, 1980, 1980, 198…
$ `FIPS State and County Codes` <chr> "01001", "01001", "01001", "01001", "01…
$ `Race/Sex Indicator` <chr> "White male", "White female", "Black ma…
$ `Under 5 years` <dbl> 985, 831, 357, 346, 4, 7, 2422, 2346, 6…
$ `5 to 9 years` <dbl> 1096, 987, 427, 395, 9, 8, 2661, 2467, …
$ `10 to 14 years` <dbl> 1271, 1074, 395, 415, 4, 11, 2783, 2614…
$ `15 to 19 years` <dbl> 1308, 1259, 460, 429, 10, 5, 3049, 2841…
$ `20 to 24 years` <dbl> 972, 1006, 300, 380, 3, 3, 2423, 2428, …
$ `25 to 29 years` <dbl> 850, 912, 240, 235, 2, 11, 2372, 2475, …
$ `30 to 34 years` <dbl> 891, 983, 163, 196, 4, 10, 2410, 2400, …
$ `35 to 39 years` <dbl> 942, 1015, 120, 158, 3, 12, 2101, 2202,…
$ `40 to 44 years` <dbl> 854, 882, 133, 147, 2, 11, 1881, 1859, …
$ `45 to 49 years` <dbl> 828, 739, 107, 154, 4, 11, 1708, 1694, …
$ `50 to 54 years` <dbl> 631, 602, 113, 165, 1, 7, 1657, 1798, 2…
$ `55 to 59 years` <dbl> 524, 532, 113, 150, 1, 2, 1641, 1943, 1…
$ `60 to 64 years` <dbl> 428, 451, 126, 166, 0, 1, 1630, 1819, 1…
$ `65 to 69 years` <dbl> 358, 417, 128, 160, 1, 0, 1503, 1729, 1…
$ `70 to 74 years` <dbl> 242, 332, 87, 119, 0, 0, 1163, 1335, 16…
$ `75 to 79 years` <dbl> 123, 237, 70, 94, 0, 0, 671, 906, 87, 1…
$ `80 to 84 years` <dbl> 52, 137, 31, 57, 0, 0, 331, 527, 43, 67…
$ `85 years and over` <dbl> 39, 86, 13, 44, 0, 1, 187, 408, 27, 65,…
Great! Now our data is all together.
Now we will wrangle the data similarly to the previous decade.
dem_80_89 <- dem_80_89 %>%
rename("race_sex" =`Race/Sex Indicator`) %>%
mutate(SEX = str_extract(race_sex, "male|female"),
RACE = str_extract(race_sex, "Black|White|Other"))%>%
select( -`race_sex`) %>%
rename("YEAR" = `Year of Estimate`)
glimpse(dem_80_89)Rows: 188,460
Columns: 22
$ YEAR <dbl> 1980, 1980, 1980, 1980, 1980, 1980, 198…
$ `FIPS State and County Codes` <chr> "01001", "01001", "01001", "01001", "01…
$ `Under 5 years` <dbl> 985, 831, 357, 346, 4, 7, 2422, 2346, 6…
$ `5 to 9 years` <dbl> 1096, 987, 427, 395, 9, 8, 2661, 2467, …
$ `10 to 14 years` <dbl> 1271, 1074, 395, 415, 4, 11, 2783, 2614…
$ `15 to 19 years` <dbl> 1308, 1259, 460, 429, 10, 5, 3049, 2841…
$ `20 to 24 years` <dbl> 972, 1006, 300, 380, 3, 3, 2423, 2428, …
$ `25 to 29 years` <dbl> 850, 912, 240, 235, 2, 11, 2372, 2475, …
$ `30 to 34 years` <dbl> 891, 983, 163, 196, 4, 10, 2410, 2400, …
$ `35 to 39 years` <dbl> 942, 1015, 120, 158, 3, 12, 2101, 2202,…
$ `40 to 44 years` <dbl> 854, 882, 133, 147, 2, 11, 1881, 1859, …
$ `45 to 49 years` <dbl> 828, 739, 107, 154, 4, 11, 1708, 1694, …
$ `50 to 54 years` <dbl> 631, 602, 113, 165, 1, 7, 1657, 1798, 2…
$ `55 to 59 years` <dbl> 524, 532, 113, 150, 1, 2, 1641, 1943, 1…
$ `60 to 64 years` <dbl> 428, 451, 126, 166, 0, 1, 1630, 1819, 1…
$ `65 to 69 years` <dbl> 358, 417, 128, 160, 1, 0, 1503, 1729, 1…
$ `70 to 74 years` <dbl> 242, 332, 87, 119, 0, 0, 1163, 1335, 16…
$ `75 to 79 years` <dbl> 123, 237, 70, 94, 0, 0, 671, 906, 87, 1…
$ `80 to 84 years` <dbl> 52, 137, 31, 57, 0, 0, 331, 527, 43, 67…
$ `85 years and over` <dbl> 39, 86, 13, 44, 0, 1, 187, 408, 27, 65,…
$ SEX <chr> "male", "female", "male", "female", "ma…
$ RACE <chr> "White", "White", "Black", "Black", "Ot…
Notice that this time the state information is based on the numeric FIPS value. We want only the first two values, as the rest indicate the county. We can use the str_sub() function of the stringr package for this. We will specify that we want to start at the first position and end at the second. Just like str_extract() we need to rename this variable first so that it is compatible.
dem_80_89 %<>%
rename("STATEFP_temp" = "FIPS State and County Codes") %>%
mutate(STATEFP = str_sub(STATEFP_temp, start = 1, end = 2)) %>%
left_join(STATE_FIPS, by = "STATEFP") %>%
dplyr::select(-STATEFP)
glimpse(dem_80_89)Rows: 188,460
Columns: 23
$ YEAR <dbl> 1980, 1980, 1980, 1980, 1980, 1980, 1980, 1980, 1…
$ STATEFP_temp <chr> "01001", "01001", "01001", "01001", "01001", "010…
$ `Under 5 years` <dbl> 985, 831, 357, 346, 4, 7, 2422, 2346, 672, 645, 3…
$ `5 to 9 years` <dbl> 1096, 987, 427, 395, 9, 8, 2661, 2467, 740, 680, …
$ `10 to 14 years` <dbl> 1271, 1074, 395, 415, 4, 11, 2783, 2614, 644, 670…
$ `15 to 19 years` <dbl> 1308, 1259, 460, 429, 10, 5, 3049, 2841, 711, 762…
$ `20 to 24 years` <dbl> 972, 1006, 300, 380, 3, 3, 2423, 2428, 516, 601, …
$ `25 to 29 years` <dbl> 850, 912, 240, 235, 2, 11, 2372, 2475, 414, 469, …
$ `30 to 34 years` <dbl> 891, 983, 163, 196, 4, 10, 2410, 2400, 303, 352, …
$ `35 to 39 years` <dbl> 942, 1015, 120, 158, 3, 12, 2101, 2202, 224, 260,…
$ `40 to 44 years` <dbl> 854, 882, 133, 147, 2, 11, 1881, 1859, 206, 288, …
$ `45 to 49 years` <dbl> 828, 739, 107, 154, 4, 11, 1708, 1694, 219, 236, …
$ `50 to 54 years` <dbl> 631, 602, 113, 165, 1, 7, 1657, 1798, 203, 261, 7…
$ `55 to 59 years` <dbl> 524, 532, 113, 150, 1, 2, 1641, 1943, 178, 219, 8…
$ `60 to 64 years` <dbl> 428, 451, 126, 166, 0, 1, 1630, 1819, 171, 209, 8…
$ `65 to 69 years` <dbl> 358, 417, 128, 160, 1, 0, 1503, 1729, 170, 232, 6…
$ `70 to 74 years` <dbl> 242, 332, 87, 119, 0, 0, 1163, 1335, 164, 182, 4,…
$ `75 to 79 years` <dbl> 123, 237, 70, 94, 0, 0, 671, 906, 87, 129, 3, 6, …
$ `80 to 84 years` <dbl> 52, 137, 31, 57, 0, 0, 331, 527, 43, 67, 1, 2, 56…
$ `85 years and over` <dbl> 39, 86, 13, 44, 0, 1, 187, 408, 27, 65, 1, 1, 30,…
$ SEX <chr> "male", "female", "male", "female", "male", "fema…
$ RACE <chr> "White", "White", "Black", "Black", "Other", "Oth…
$ STATE <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alab…
AVOCADO - why do we group by age_group sex and race this time but not with the last decade?
dem_80_89 %<>%
pivot_longer(cols=contains("years"),
names_to = "AGE_GROUP",
values_to = "SUB_POP_temp") %>%
group_by(YEAR, STATE, AGE_GROUP, SEX, RACE) %>%
summarise(SUB_POP = sum(SUB_POP_temp), .groups="drop")
dem_80_89# A tibble: 55,080 x 6
YEAR STATE AGE_GROUP SEX RACE SUB_POP
<dbl> <chr> <chr> <chr> <chr> <dbl>
1 1980 Alabama 10 to 14 years female Black 50108
2 1980 Alabama 10 to 14 years female Other 805
3 1980 Alabama 10 to 14 years female White 109066
4 1980 Alabama 10 to 14 years male Black 50768
5 1980 Alabama 10 to 14 years male Other 826
6 1980 Alabama 10 to 14 years male White 115988
7 1980 Alabama 15 to 19 years female Black 58428
8 1980 Alabama 15 to 19 years female Other 743
9 1980 Alabama 15 to 19 years female White 126783
10 1980 Alabama 15 to 19 years male Black 56808
# … with 55,070 more rows
pop_80_89 <- dem_80_89 %>%
group_by(YEAR, STATE) %>%
summarise("TOT_POP" = sum(SUB_POP), .groups = "drop")
dem_80_89 <- dem_80_89 %>%
left_join(pop_80_89, by = c("YEAR","STATE")) %>%
mutate(PERC_SUB_POP = (SUB_POP/TOT_POP)*100) %>%
dplyr::select(-SUB_POP, -TOT_POP)
dem_80_89# A tibble: 55,080 x 6
YEAR STATE AGE_GROUP SEX RACE PERC_SUB_POP
<dbl> <chr> <chr> <chr> <chr> <dbl>
1 1980 Alabama 10 to 14 years female Black 1.28
2 1980 Alabama 10 to 14 years female Other 0.0206
3 1980 Alabama 10 to 14 years female White 2.80
4 1980 Alabama 10 to 14 years male Black 1.30
5 1980 Alabama 10 to 14 years male Other 0.0212
6 1980 Alabama 10 to 14 years male White 2.97
7 1980 Alabama 15 to 19 years female Black 1.50
8 1980 Alabama 15 to 19 years female Other 0.0191
9 1980 Alabama 15 to 19 years female White 3.25
10 1980 Alabama 15 to 19 years male Black 1.46
# … with 55,070 more rows
Rows: 43,870
Columns: 19
$ Year <dbl> NA, 1990, 1990, 1990, 1990, 1990, 1990, 1990, 1990, 1990, 19…
$ e <chr> NA, "01", "01", "01", "01", "01", "01", "01", "01", "01", "0…
$ Age <dbl> NA, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16…
$ Male <dbl> NA, 20406, 19393, 18990, 19246, 19502, 19560, 19091, 19605, …
$ Female <dbl> NA, 19101, 18114, 18043, 17786, 18366, 18386, 18047, 18316, …
$ Male_1 <dbl> NA, 9794, 9475, 9097, 9002, 9076, 9169, 8919, 9219, 9247, 10…
$ Female_1 <dbl> NA, 9414, 9247, 8837, 8701, 8989, 9093, 8736, 9192, 9108, 97…
$ Male_2 <dbl> NA, 103, 87, 97, 94, 108, 128, 160, 178, 166, 205, 194, 179,…
$ Female_2 <dbl> NA, 90, 93, 100, 115, 114, 130, 134, 162, 155, 193, 185, 202…
$ Male_3 <dbl> NA, 192, 146, 175, 150, 168, 170, 183, 171, 136, 177, 169, 1…
$ Female_3 <dbl> NA, 170, 182, 160, 157, 178, 158, 173, 177, 185, 179, 171, 1…
$ Male_4 <dbl> NA, 223, 190, 198, 186, 190, 210, 188, 178, 182, 221, 194, 1…
$ Female_4 <dbl> NA, 220, 196, 173, 191, 190, 170, 172, 179, 173, 166, 175, 1…
$ Male_5 <dbl> NA, 47, 41, 32, 35, 36, 30, 28, 27, 29, 32, 31, 33, 34, 32, …
$ Female_5 <dbl> NA, 45, 47, 41, 30, 26, 37, 23, 35, 31, 28, 38, 22, 39, 29, …
$ Male_6 <dbl> NA, 1, 2, 1, 9, 5, 8, 2, 4, 6, 6, 0, 1, 9, 6, 7, 5, 2, 2, 4,…
$ Female_6 <dbl> NA, 8, 0, 2, 1, 4, 5, 3, 4, 4, 3, 4, 2, 2, 7, 0, 2, 2, 1, 6,…
$ Male_7 <dbl> NA, 5, 7, 2, 3, 5, 11, 2, 7, 12, 10, 7, 5, 6, 5, 6, 6, 2, 11…
$ Female_7 <dbl> NA, 5, 5, 5, 3, 14, 6, 7, 6, 3, 11, 5, 5, 7, 8, 6, 6, 7, 3, …
colnames(dem_90_99) <- c("YEAR",
"STATEFP",
"Age",
"NH_W_M",
"NH_W_F",
"NH_B_M",
"NH_B_F",
"NH_AIAN_M",
"NH_AIAN_F",
"NH_API_M",
"NH_API_F",
"H_W_M",
"H_W_F",
"H_B_M",
"H_B_F",
"H_AIAN_M",
"H_AIAN_F",
"H_API_M",
"H_API_F")
dim(dem_90_99)[1] 43870 19
dem_90_99 <- dem_90_99 %>%
mutate(W_M = NH_W_M + H_W_M,
W_F = NH_W_F + H_W_F,
B_M = NH_B_M + H_B_M,
B_F = NH_B_F + H_B_F,
AIAN_M = NH_AIAN_M + H_AIAN_M,
AIAN_F = NH_AIAN_F + H_AIAN_F,
API_M = NH_API_M + H_API_M,
API_F = NH_API_F + H_API_F,
n_na = rowSums(is.na(.))) %>%
dplyr::select(-starts_with("NH_"), -starts_with("H_"))
dem_90_99 %>%
group_by(n_na) %>%
tally()# A tibble: 2 x 2
n_na n
<dbl> <int>
1 0 43860
2 19 10
empty_rows_na <- dem_90_99 %>%
group_by(n_na) %>%
tally() %>%
filter(n_na != 0) %>%
pull(n_na)
dem_90_99 <- dem_90_99 %>%
filter(n_na != empty_rows_na) %>%
dplyr::select(-n_na)
sapply(dem_90_99, class) YEAR STATEFP Age W_M W_F B_M
"numeric" "character" "numeric" "numeric" "numeric" "numeric"
B_F AIAN_M AIAN_F API_M API_F
"numeric" "numeric" "numeric" "numeric" "numeric"
10 to 14 years 15 to 19 years 20 to 24 years 25 to 29 years
3060 3060 3060 3060
30 to 34 years 35 to 39 years 40 to 44 years 45 to 49 years
3060 3060 3060 3060
5 to 9 years 50 to 54 years 55 to 59 years 60 to 64 years
3060 3060 3060 3060
65 to 69 years 70 to 74 years 75 to 79 years 80 to 84 years
3060 3060 3060 3060
85 years and over Under 5 years
3060 3060
dem_90_99 <- dem_90_99 %>%
mutate(AGE_GROUP = cut(Age,
breaks = seq(0,90, by=5),
right = FALSE,
labels = c("Under 5 years",
"5 to 9 years",
"10 to 14 years",
"15 to 19 years",
"20 to 24 years",
"25 to 29 years",
"30 to 34 years",
"35 to 39 years",
"40 to 44 years",
"45 to 49 years",
"50 to 54 years",
"55 to 59 years",
"60 to 64 years",
"65 to 69 years",
"70 to 74 years",
"75 to 79 years",
"80 to 84 years",
"85 years and over")
)) %>%
dplyr::select(-Age) %>%
mutate(AGE_GROUP = as.character(AGE_GROUP))
sapply(dem_90_99, class) YEAR STATEFP W_M W_F B_M B_F
"numeric" "character" "numeric" "numeric" "numeric" "numeric"
AIAN_M AIAN_F API_M API_F AGE_GROUP
"numeric" "numeric" "numeric" "numeric" "character"
dem_90_99 <- dem_90_99 %>%
group_by(YEAR, STATEFP, AGE_GROUP) %>%
summarise_at(vars(starts_with("W_"),
starts_with("B_"),
starts_with("AIAN_"),
starts_with("API_")), sum) %>%
ungroup() %>%
pivot_longer(cols = c(starts_with("W_"),
starts_with("B_"),
starts_with("AIAN_"),
starts_with("API_")),
names_to = "RACE",
values_to = "SUB_POP")
dem_90_99 <- dem_90_99 %>%
mutate(SEX = case_when(str_detect(RACE, "_M") ~ "Male",
TRUE ~ "Female"),
RACE = case_when(str_detect(RACE, "W_") ~ "White",
str_detect(RACE, "B_") ~ "Black",
TRUE ~ "Other")) %>%
left_join(STATE_FIPS, by = "STATEFP") %>%
dplyr::select(-STATEFP)
pop_90_99 <- dem_90_99 %>%
group_by(YEAR, STATE) %>%
summarise(TOT_POP = sum(SUB_POP), .groups = "drop")
dem_90_99 <- dem_90_99 %>%
left_join(pop_90_99, by=c("YEAR", "STATE")) %>%
mutate(PERC_SUB_POP = SUB_POP*100/TOT_POP) %>%
dplyr::select(-SUB_POP, -TOT_POP)Rows: 62,244
Columns: 21
$ REGION <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ DIVISION <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ STATE <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ NAME <chr> "United States", "United States", "United States", …
$ SEX <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ ORIGIN <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ RACE <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ AGEGRP <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1…
$ ESTIMATESBASE2000 <dbl> 281424600, 19176154, 20549855, 20528425, 20218782, …
$ POPESTIMATE2000 <dbl> 282162411, 19178293, 20463852, 20637696, 20294955, …
$ POPESTIMATE2001 <dbl> 284968955, 19298217, 20173362, 20978678, 20456284, …
$ POPESTIMATE2002 <dbl> 287625193, 19429192, 19872417, 21261421, 20610370, …
$ POPESTIMATE2003 <dbl> 290107933, 19592446, 19620851, 21415353, 20797166, …
$ POPESTIMATE2004 <dbl> 292805298, 19785885, 19454237, 21411680, 21102552, …
$ POPESTIMATE2005 <dbl> 295516599, 19917400, 19389067, 21212579, 21486214, …
$ POPESTIMATE2006 <dbl> 298379912, 19938883, 19544688, 21033138, 21807709, …
$ POPESTIMATE2007 <dbl> 301231207, 20125962, 19714611, 20841042, 22067816, …
$ POPESTIMATE2008 <dbl> 304093966, 20271127, 19929602, 20706655, 22210880, …
$ POPESTIMATE2009 <dbl> 306771529, 20244518, 20182499, 20660564, 22192810, …
$ CENSUS2010POP <dbl> 308745538, 20201362, 20348657, 20677194, 22040343, …
$ POPESTIMATE2010 <dbl> 309349689, 20200529, 20382409, 20694011, 21959087, …
dem_00_10 <- dem_00_10 %>%
dplyr::select(-ESTIMATESBASE2000,-CENSUS2010POP) %>%
filter(REGION != 0,
DIVISION != 0,
SEX != 0,
ORIGIN == 0,
RACE != 0,
AGEGRP != 0,
STATE != 0) %>%
dplyr::select(-REGION, -DIVISION, -ORIGIN, -STATE) %>%
rename("STATE" = NAME,
"AGE_GROUP" = AGEGRP) %>%
mutate(SEX = factor(SEX,
levels = 1:2,
labels = c("Male",
"Female")),
RACE = factor(RACE,
levels = 1:6,
labels = c("White",
"Black",
rep("Other",4))),
AGE_GROUP = factor(AGE_GROUP,
levels = 1:18,
labels = c("Under 5 years",
"5 to 9 years",
"10 to 14 years",
"15 to 19 years",
"20 to 24 years",
"25 to 29 years",
"30 to 34 years",
"35 to 39 years",
"40 to 44 years",
"45 to 49 years",
"50 to 54 years",
"55 to 59 years",
"60 to 64 years",
"65 to 69 years",
"70 to 74 years",
"75 to 79 years",
"80 to 84 years",
"85 years and over"))) %>%
mutate(SEX = as.character(SEX),
RACE = as.character(RACE),
AGE_GROUP = as.character(AGE_GROUP))
colnames(dem_00_10) [1] "STATE" "SEX" "RACE" "AGE_GROUP"
[5] "POPESTIMATE2000" "POPESTIMATE2001" "POPESTIMATE2002" "POPESTIMATE2003"
[9] "POPESTIMATE2004" "POPESTIMATE2005" "POPESTIMATE2006" "POPESTIMATE2007"
[13] "POPESTIMATE2008" "POPESTIMATE2009" "POPESTIMATE2010"
dem_00_10 <- dem_00_10 %>%
pivot_longer(cols=contains("ESTIMATE"),
names_to = "YEAR",
values_to = "SUB_POP")
dem_00_10 <- dem_00_10 %>%
mutate(YEAR = str_sub(YEAR, start=-4)) %>%
mutate(YEAR = as.numeric(YEAR))
sapply(dem_00_10, class) STATE SEX RACE AGE_GROUP YEAR SUB_POP
"character" "character" "character" "character" "numeric" "numeric"
pop_00_10 <- dem_00_10 %>%
group_by(YEAR, STATE) %>%
summarise(TOT_POP = sum(SUB_POP), .groups = "drop")
dem_00_10 %>%
left_join(pop_00_10, by=c("YEAR", "STATE")) %>%
group_by(YEAR, STATE) %>%
mutate(PERC_SUB_POP = SUB_POP*100/TOT_POP) %>%
summarise(perc_tot = sum(PERC_SUB_POP), .groups = "drop") %>%
mutate(poss_error = case_when(abs(perc_tot - 100) > 0 ~ TRUE,
TRUE ~ FALSE)) %>%
group_by(poss_error) %>%
tally()# A tibble: 1 x 2
poss_error n
<lgl> <int>
1 FALSE 561
[1] TRUE
[1] TRUE
[1] TRUE
# A tibble: 6 x 6
YEAR STATE SEX RACE AGE_GROUP PERC_SUB_POP
<dbl> <chr> <chr> <chr> <chr> <dbl>
1 1977 Alabama male White Under 5 years 2.61
2 1977 Alabama male White 5 to 9 years 3.00
3 1977 Alabama male White 10 to 14 years 3.25
4 1977 Alabama male White 15 to 19 years 3.58
5 1977 Alabama male White 20 to 24 years 3.33
6 1977 Alabama male White 25 to 29 years 2.95
# A tibble: 6 x 6
YEAR STATE AGE_GROUP SEX RACE PERC_SUB_POP
<dbl> <chr> <chr> <chr> <chr> <dbl>
1 1980 Alabama 10 to 14 years female Black 1.28
2 1980 Alabama 10 to 14 years female Other 0.0206
3 1980 Alabama 10 to 14 years female White 2.80
4 1980 Alabama 10 to 14 years male Black 1.30
5 1980 Alabama 10 to 14 years male Other 0.0212
6 1980 Alabama 10 to 14 years male White 2.97
# A tibble: 6 x 6
YEAR AGE_GROUP RACE SEX STATE PERC_SUB_POP
<dbl> <chr> <chr> <chr> <chr> <dbl>
1 1990 10 to 14 years White Male Alabama 2.46
2 1990 10 to 14 years White Female Alabama 2.33
3 1990 10 to 14 years Black Male Alabama 1.21
4 1990 10 to 14 years Black Female Alabama 1.20
5 1990 10 to 14 years Other Male Alabama 0.0239
6 1990 10 to 14 years Other Female Alabama 0.0235
# A tibble: 6 x 6
STATE SEX RACE AGE_GROUP YEAR PERC_SUB_POP
<chr> <chr> <chr> <chr> <dbl> <dbl>
1 Alabama Male White Under 5 years 2000 2.24
2 Alabama Male White Under 5 years 2001 2.24
3 Alabama Male White Under 5 years 2002 2.22
4 Alabama Male White Under 5 years 2003 2.21
5 Alabama Male White Under 5 years 2004 2.20
6 Alabama Male White Under 5 years 2005 2.20
[1] 18
[1] 18
[1] 18
[1] 18
dem <- bind_rows(dem_77_79,
dem_80_89,
dem_90_99,
dem_00_10)
dem %>%
filter(RACE == "Other") %>%
group_by(YEAR) %>%
tally() %>%
summarise(years_data = n())# A tibble: 1 x 1
years_data
<int>
1 34
[1] 34
DONOHUE_AGE_GROUPS <- c("15 to 19 years",
"20 to 24 years",
"25 to 29 years",
"30 to 34 years",
"35 to 39 years")
DONOHUE_RACE <- c("White",
"Black",
"Other")
DONOHUE_SEX <- c("Male")
dem_DONOHUE <- dem %>%
filter(AGE_GROUP %in% DONOHUE_AGE_GROUPS,
RACE %in% DONOHUE_RACE,
SEX %in% DONOHUE_SEX) %>%
mutate(AGE_GROUP = fct_collapse(AGE_GROUP, "20 to 39 years"=c("20 to 24 years",
"25 to 29 years",
"30 to 34 years",
"35 to 39 years"))) %>%
mutate(AGE_GROUP = str_replace_all(AGE_GROUP," ","_")) %>%
group_by(YEAR, STATE, RACE, SEX, AGE_GROUP) %>%
summarise(PERC_SUB_POP = sum(PERC_SUB_POP), .groups = "drop") %>%
unite(col = "VARIABLE", RACE, SEX, AGE_GROUP, sep = "_") %>%
rename("VALUE"=PERC_SUB_POP)
LOTT_AGE_GROUPS_NULL <- c("Under 5 years",
"5 to 9 years")
LOTT_RACE <- c("White",
"Black",
"Other")
LOTT_SEX <- c("Male",
"Female")
dem_LOTT <- dem %>%
filter(!(AGE_GROUP %in% LOTT_AGE_GROUPS_NULL),
RACE %in% LOTT_RACE,
SEX %in% LOTT_SEX) %>%
mutate(AGE_GROUP = fct_collapse(AGE_GROUP,
"10 to 19 years"=c("10 to 14 years",
"15 to 19 years"),
"20 to 29 years"=c("20 to 24 years",
"25 to 29 years"),
"30 to 39 years"=c("30 to 34 years",
"35 to 39 years"),
"40 to 49 years"=c("40 to 44 years",
"45 to 49 years"),
"50 to 64 years"=c("50 to 54 years",
"55 to 59 years",
"60 to 64 years"),
"65 years and over"=c("65 to 69 years",
"70 to 74 years",
"75 to 79 years",
"80 to 84 years",
"85 years and over"))) %>%
mutate(AGE_GROUP = str_replace_all(AGE_GROUP," ","_")) %>%
group_by(YEAR, STATE, RACE, SEX, AGE_GROUP) %>%
summarise(PERC_SUB_POP = sum(PERC_SUB_POP), .groups = "drop") %>%
unite(col = "VARIABLE", RACE, SEX, AGE_GROUP, sep = "_") %>%
rename("VALUE"=PERC_SUB_POP)
dim(expand.grid(c(1:6), c(7:8), c(9:10)))[1][1] 24
[1] TRUE
[1] TRUE
[1] TRUE
# A tibble: 6 x 3
YEAR STATE TOT_POP
<dbl> <chr> <dbl>
1 1977 Alabama 3782571
2 1977 Alaska 397220
3 1977 Arizona 2427296
4 1977 Arkansas 2207195
5 1977 California 22350332
6 1977 Colorado 2696179
# A tibble: 6 x 3
YEAR STATE TOT_POP
<dbl> <chr> <dbl>
1 1980 Alabama 3899671
2 1980 Alaska 404680
3 1980 Arizona 2735840
4 1980 Arkansas 2288809
5 1980 California 23792840
6 1980 Colorado 2909545
# A tibble: 6 x 3
YEAR STATE TOT_POP
<dbl> <chr> <dbl>
1 1990 Alabama 4048508
2 1990 Alaska 553120
3 1990 Arizona 3679056
4 1990 Arkansas 2354343
5 1990 California 29950111
6 1990 Colorado 3303862
# A tibble: 6 x 3
YEAR STATE TOT_POP
<dbl> <chr> <dbl>
1 2000 Alabama 4452173
2 2000 Alaska 627963
3 2000 Arizona 5160586
4 2000 Arkansas 2678588
5 2000 California 33987977
6 2000 Colorado 4326921
population_data <- bind_rows(pop_77_79,
pop_80_89,
pop_90_99,
pop_00_10)
population_data %>%
group_by(YEAR) %>%
tally() %>%
print(n = dim(.)[1])# A tibble: 34 x 2
YEAR n
<dbl> <int>
1 1977 51
2 1978 51
3 1979 51
4 1980 51
5 1981 51
6 1982 51
7 1983 51
8 1984 51
9 1985 51
10 1986 51
11 1987 51
12 1988 51
13 1989 51
14 1990 51
15 1991 51
16 1992 51
17 1993 51
18 1994 51
19 1995 51
20 1996 51
21 1997 51
22 1998 51
23 1999 51
24 2000 51
25 2001 51
26 2002 51
27 2003 51
28 2004 51
29 2005 51
30 2006 51
31 2007 51
32 2008 51
33 2009 51
34 2010 51
[1] "data_year" "ori" "pub_agency_name"
[4] "pub_agency_unit" "state_abbr" "division_name"
[7] "region_name" "county_name" "agency_type_name"
[10] "population_group_desc" "population" "male_officer_ct"
[13] "male_civilian_ct" "male_total_ct" "female_officer_ct"
[16] "female_civilian_ct" "female_total_ct" "officer_ct"
[19] "civilian_ct" "total_pe_ct" "pe_ct_per_1000"
ps_data <- ps_data %>%
filter(data_year >= 1977,
data_year <= 2014) %>%
mutate(male_total_ct = case_when(is.na(male_total_ct) ~ 0,
TRUE ~ male_total_ct),
female_total_ct = case_when(is.na(female_total_ct) ~ 0,
TRUE ~ female_total_ct)) %>%
mutate(officer_total = male_total_ct + female_total_ct) %>%
dplyr::select(data_year,
pub_agency_name,
state_abbr,
officer_total)
ps_data <- ps_data %>%
group_by(data_year, state_abbr) %>%
summarise(officer_state_total=sum(officer_total), .groups = "drop")
ps_data %>%
group_by(state_abbr) %>%
tally() %>%
print(n = dim(.)[1])# A tibble: 59 x 2
state_abbr n
<chr> <int>
1 AK 38
2 AL 38
3 AR 38
4 AS 38
5 AZ 38
6 CA 38
7 CO 38
8 CT 38
9 CZ 38
10 DC 38
11 DE 38
12 FL 38
13 FS 38
14 GA 38
15 GM 38
16 HI 38
17 IA 38
18 ID 38
19 IL 38
20 IN 38
21 KS 38
22 KY 38
23 LA 38
24 MA 38
25 MD 38
26 ME 38
27 MI 38
28 MN 38
29 MO 38
30 MP 38
31 MS 38
32 MT 38
33 NB 38
34 NC 38
35 ND 38
36 NH 38
37 NJ 38
38 NM 38
39 NV 38
40 NY 38
41 OH 38
42 OK 38
43 OR 38
44 OT 38
45 PA 38
46 PR 38
47 RI 38
48 SC 38
49 SD 38
50 TN 38
51 TX 38
52 UT 38
53 VA 38
54 VI 38
55 VT 38
56 WA 38
57 WI 38
58 WV 38
59 WY 38
# NB is Nebraska. This was changed to NE to avoid confusions with NB in Canada. This dataset uses NB
state_of_interest_NULL <- c("AS",
"GM",
"CZ",
"FS",
"MP",
"OT",
"PR",
"VI")
state_abb_df <- as.data.frame(cbind(state.abb, state.name))
colnames(state_abb_df) <- c("state_abbr", "STATE")
print(state_abb_df) state_abbr STATE
1 AL Alabama
2 AK Alaska
3 AZ Arizona
4 AR Arkansas
5 CA California
6 CO Colorado
7 CT Connecticut
8 DE Delaware
9 FL Florida
10 GA Georgia
11 HI Hawaii
12 ID Idaho
13 IL Illinois
14 IN Indiana
15 IA Iowa
16 KS Kansas
17 KY Kentucky
18 LA Louisiana
19 ME Maine
20 MD Maryland
21 MA Massachusetts
22 MI Michigan
23 MN Minnesota
24 MS Mississippi
25 MO Missouri
26 MT Montana
27 NE Nebraska
28 NV Nevada
29 NH New Hampshire
30 NJ New Jersey
31 NM New Mexico
32 NY New York
33 NC North Carolina
34 ND North Dakota
35 OH Ohio
36 OK Oklahoma
37 OR Oregon
38 PA Pennsylvania
39 RI Rhode Island
40 SC South Carolina
41 SD South Dakota
42 TN Tennessee
43 TX Texas
44 UT Utah
45 VT Vermont
46 VA Virginia
47 WA Washington
48 WV West Virginia
49 WI Wisconsin
50 WY Wyoming
state_abb_df <- state_abb_df %>%
add_row(state_abbr="DC",
STATE="District of Columbia")
denominator_temp <- population_data %>%
dplyr::select(-VARIABLE) %>%
rename("Population_temp"=VALUE)
ps_data <- ps_data %>%
filter(!(state_abbr %in% state_of_interest_NULL)) %>%
mutate(state_abbr = case_when(state_abbr == "NB" ~ "NE",
TRUE ~ state_abbr)) %>%
left_join(state_abb_df, by = "state_abbr") %>%
dplyr::select(-state_abbr) %>%
rename(YEAR = "data_year",
VALUE = "officer_state_total") %>%
mutate(VARIABLE = "officer_state_total") %>%
left_join(denominator_temp, by=c("STATE","YEAR")) %>%
mutate(VALUE = (VALUE*100000) / Population_temp) %>%
mutate(VALUE = lag(VALUE)) %>%
mutate(VARIABLE = "police_per_100k_lag") %>%
dplyr::select(-Population_temp)# A tibble: 1 x 14
Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2020 3.2 2.9 3 13.3 NA NA NA NA NA NA NA NA
# … with 1 more variable: Annual <dbl>
[1] "STATE" "Year" "Jan" "Feb" "Mar" "Apr" "May" "Jun"
[9] "Jul" "Aug" "Sep" "Oct" "Nov" "Dec" "Annual"
STATE Year Jan Feb Mar Apr
"character" "numeric" "numeric" "numeric" "numeric" "numeric"
May Jun Jul Aug Sep Oct
"numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
Nov Dec Annual
"numeric" "numeric" "numeric"
# A tibble: 6 x 6
`NOTE: Number in thousa… ...2 ...3 ...4 ...5 ...6
<chr> <chr> <chr> <chr> <chr> <chr>
1 2018 <NA> <NA> <NA> <NA> <NA>
2 STATE Total Number "Standard\n… Percent "Standard\ner…
3 Alabama 4877 779 "65" 16 "1.3"
4 Alaska 720 94 "9" 13.1 "1.2"
5 Arizona 7241 929 "80" 12.80000000… "1.1000000000…
6 Arkansas 2912 462 "38" 15.9 "1.3"
colnames(poverty_rate_data) <- c("STATE",
"Total",
"Number",
"Number_se",
"Percent",
"Percent_se")
tail(poverty_rate_data)# A tibble: 6 x 6
STATE Total Number Number_se Percent Percent_se
<chr> <chr> <chr> <chr> <chr> <chr>
1 Wisconsin 4724 403 57 8.5 1.1000000000…
2 Wyoming 468 49 20 10.4 4
3 Standard errors shown in this ta… <NA> <NA> <NA> <NA> <NA>
4 For information on confidentiali… <NA> <NA> <NA> <NA> <NA>
5 Footnotes are available at <www.… <NA> <NA> <NA> <NA> <NA>
6 SOURCE: U.S. Bureau of the Censu… <NA> <NA> <NA> <NA> <NA>
notes <- 4
poverty_rate_data <- poverty_rate_data[-((dim(poverty_rate_data)[1]-notes+1):dim(poverty_rate_data)[1]),]
states_eq <- 51
extra_col <- 2
rep_rows <- states_eq + extra_col
groups <- (dim(poverty_rate_data)[1])/(rep_rows)
paste(groups - (2018-1980 + 1), "extra groups")[1] "2 extra groups"
poverty_rate_data$year_group <- rep(1:groups, each=rep_rows)
poverty_rate_data <- poverty_rate_data %>%
group_by(year_group) %>%
group_split()
head(poverty_rate_data[[1]])# A tibble: 6 x 7
STATE Total Number Number_se Percent Percent_se year_group
<chr> <chr> <chr> <chr> <chr> <chr> <int>
1 2018 <NA> <NA> <NA> <NA> <NA> 1
2 STATE Total Number "Standard\ner… Percent "Standard\nerro… 1
3 Alabama 4877 779 "65" 16 "1.3" 1
4 Alaska 720 94 "9" 13.1 "1.2" 1
5 Arizona 7241 929 "80" 12.8000000000… "1.100000000000… 1
6 Arkans… 2912 462 "38" 15.9 "1.3" 1
poverty_rate_data <- poverty_rate_data %>%
map(~mutate(.,
row_id = row_number())) %>%
map(~filter(.,row_id != 2)) %>%
map(~dplyr::select(.,-row_id))
poverty_rate_data_names <- poverty_rate_data %>%
sapply(., "[",1,1, drop=TRUE) %>%
str_replace_all(.,"[:space:]","_")
names(poverty_rate_data) <- poverty_rate_data_names
# Recall 2 extra groups.
# footnotes available at https://www.census.gov/topics/income-poverty/poverty/guidance/poverty-footnotes/cps-historic-footnotes.html
poverty_rate_data$`2017_(21)` <- NULL
poverty_rate_data$`2013_(19)` <- NULL
poverty_rate_data_names <- poverty_rate_data %>%
sapply(., "[",1,1, drop=TRUE) %>%
str_sub(., start = 1, end=4)
names(poverty_rate_data) <- poverty_rate_data_names
poverty_rate_data <- poverty_rate_data %>%
map_df(bind_rows, .id = "YEAR") %>%
dplyr::select(-year_group)
poverty_rate_data <- poverty_rate_data %>%
mutate(n_na = rowSums(is.na(.)))
# This shows that there is systematic missing values stemmingly *solely* from the rows without poverty data and only a label designating the year
poverty_rate_data %>%
group_by(n_na) %>%
tally()# A tibble: 2 x 2
n_na n
<dbl> <int>
1 0 1989
2 5 39
YEAR STATE Total Number Number_se Percent
"character" "character" "character" "character" "character" "character"
Percent_se n_na
"character" "numeric"
poverty_rate_data <- poverty_rate_data %>%
drop_na() %>%
dplyr::select(-Number,
-Number_se,
-Percent_se,
-n_na,
-Total) %>%
rename("VALUE"=Percent) %>%
mutate(VARIABLE = "Poverty_rate",
YEAR = as.numeric(YEAR),
VALUE = as.numeric(VALUE))
colnames(poverty_rate_data)[1] "YEAR" "STATE" "VALUE" "VARIABLE"
https://www.ucrdatatool.gov/Search/Crime/State/StatebyState.cfm
crime_data <- read_lines("docs/Crime/CrimeStatebyState.csv", skip = 2, skip_empty_rows = TRUE)
length(crime_data)[1] 2254
crime_data <- crime_data[-(2143:length(crime_data))]
x <- 2014-1977+1
rep_cycle <- 2 + 2 + x
rep_cycle_cut <- 2 + x
delete_rows <- c(seq(2,length(crime_data),rep_cycle),
seq(3,length(crime_data),rep_cycle))
crime_data <- crime_data[-delete_rows]
crime_data <- data.frame(cbind(crime_data, rep(1:(length(crime_data)/rep_cycle_cut),each=rep_cycle_cut)))
colnames(crime_data) <- c("String","STATE_GROUP")
crime_data <- crime_data %>%
group_by(STATE_GROUP) %>%
group_split()
columns_crime_data <- 8
crime_data <- crime_data %>%
map(~mutate(.,
State = case_when(str_detect(String, "Estimated crime in ") ~ substring(String, nchar("Estimated crime in ")+1)),
row_id = row_number())) %>%
map(~fill(., State)) %>%
map(~filter(.,row_id > 2)) %>%
map(~mutate(.,
String = paste0(String, ",", State))) %>%
map(~dplyr::select(.,String)) %>%
map(~str_split_fixed(.$String,",",columns_crime_data + 1)) %>%
map(~data.frame(.)) %>%
map(~rename(.,"YEAR"=X1,
"Extra_col1"=X2,
"VC"=X3,
"Extra_col2"=X4,
"Extra_col3"=X5,
"Extra_col4"=X6,
"Extra_col5"=X7,
"Extra_col6"=X8,
"STATE"=X9)) %>%
map(~dplyr::select(.,-contains("Extra_col"))) %>%
map(~.x %>% mutate_all(~trimws(.,which = "both"))) %>%
map_df(bind_rows)
sapply(crime_data, class) YEAR VC STATE
"character" "character" "character"
DAWpaper_p_62 <- DAWpaper[[62]]
p_62 <- DAWpaper_p_62 %>%
strsplit("\n") %>%
unlist() %>%
as.data.frame() %>%
slice(-(1:2))
apply(p_62, 1, nchar) [1] 109 109 109 109 109 109 109 109 109 109 109 109 109 109 109 109 109 109 109
[20] 109 109 109 109 109 109 109 109 109 109 109 109 109 109 109 109 109 109 109
[39] 109 109 109 109 109 109 109 109 109 109 109 109 109 109 63
[1] " 60"
[1] 3 4 4 4 3 4 2 3 2 4 4 3 4 3 4 4 4 4 4 4 3 2 4 4 4 4 4 4 4 2 3 3 3 3 3 4 4 4
[39] 3 3 2 3 3 4 4 4 3 4 2 3 4 4
[1] 2 3 3 3 2 3 2 2 2 3 3 2 3 2 3 3 3 3 3 3 2 2 3 3 3 3 3 3 3 2 2 3 2 3 3 3 3 3
[39] 3 3 2 3 3 3 3 3 2 3 2 3 3 3
[1] 2 2 2 2 1 2 1 1 1 2 2 2 2 1 2 2 2 2 2 2 1 1 2 2 2 2 2 2 2 1 1 2 2 2 2 2 2 2
[39] 2 2 1 2 2 2 2 2 2 2 1 2 2 2
[1] 1 0 0 0 1 0 1 1 1 0 0 1 0 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0
[39] 0 0 1 0 0 0 0 0 1 0 1 0 0 0
.
1 Alabama 1975 1975
2 Alaska 10/1/1994 0.252 1995
3 Arizona 7/17/1994 0.460 1995
4 Arkansas 7/27/1995 0.433 1996
5 California N/A 0
6 Colorado 5/17/2003 0.627 2003
apply(p_62, 1, str_count, "\\\\s{40,}")
1 1
2 0
3 0
4 0
5 1
6 0
p_62 <- p_62 %>%
apply(1,str_replace_all, "\\s{40,}", "|N/A|") %>%
str_replace_all("\\s{2,15}", "|") %>%
as.data.frame()
p_62 <- sapply(p_62$., str_split, "\\|{1,}")
sapply(p_62, nchar)$`|Alabama||1975|N/A|1975`
[1] 0 7 4 3 4
$`|Alaska||10/1/1994||0.252|||1995`
[1] 0 6 9 5 4
$`|Arizona| 7/17/1994||0.460|||1995`
[1] 0 7 10 5 4
$`|Arkansas| 7/27/1995||0.433|||1996`
[1] 0 8 10 5 4
$`|California||N/A|N/A|0`
[1] 0 10 3 3 1
$`|Colorado| 5/17/2003||0.627|||2003`
[1] 0 8 10 5 4
$`|Connecticut||1970|N/A|1970`
[1] 0 11 4 3 4
$`|Delaware||N/A|N/A|0`
[1] 0 8 3 3 1
$`District of Columbia|N/A|N/A|0`
[1] 20 3 3 1
$`|Florida| 10/1/1987||0.252|||1988`
[1] 0 7 10 5 4
$`|Georgia| 8/25/1989||0.353|||1990`
[1] 0 7 10 5 4
$`|Hawaii||N/A|N/A|0`
[1] 0 6 3 3 1
$`|Idaho||7/1/1990||0.504|||1990`
[1] 0 5 8 5 4
$`|Illinois| 1/5/2014|N/A|2014`
[1] 0 8 9 3 4
$`|Indiana| 1/15/1980||0.962|||1980`
[1] 0 7 10 5 4
$`|Iowa||1/1/2011||1.000|||2011`
[1] 0 4 8 5 4
$`|Kansas||1/1/2007||1.000|||2007`
[1] 0 6 8 5 4
$`|Kentucky| 10/1/1996||0.251|||1997`
[1] 0 8 10 5 4
$`|Louisiana|4/19/1996||0.702|||1996`
[1] 0 9 9 5 4
$`|Maine||9/19/1985||0.285|||1986`
[1] 0 5 9 5 4
$`|Maryland||N/A|N/A|0`
[1] 0 8 3 3 1
$`|Massachusetts||N/A|N/A|0`
[1] 0 13 3 3 1
$`|Michigan||7/1/2001||0.504|||2001`
[1] 0 8 8 5 4
$`|Minnesota| 5/28/2003||0.597|||2003`
[1] 0 9 10 5 4
$`|Mississippi|7/1/1990||0.504|||1990`
[1] 0 11 8 5 4
$`|Missouri| 2/26/2004||0.847|||2004`
[1] 0 8 10 5 4
$`|Montana||10/1/1991||0.252|||1992`
[1] 0 7 9 5 4
$`|Nebraska||1/1/2007||1.000|||2007`
[1] 0 8 8 5 4
$`|Nevada||10/1/1995||0.252|||1996`
[1] 0 6 9 5 4
$`|New Hampshire||1959|N/A|1959`
[1] 0 13 4 3 4
$`|New Jersey||N/A|N/A|0`
[1] 0 10 3 3 1
$`|New Mexico||1/1/2004||1.000|||2004`
[1] 0 10 8 5 4
$`|New York||N/A|N/A|0`
[1] 0 8 3 3 1
$`|North Carolina|12/1/1995||0.085|||1996`
[1] 0 14 9 5 4
$`|North Dakota| 8/1/1985||0.419|||1986`
[1] 0 12 9 5 4
$`|Ohio||4/8/2004||0.732|||2004`
[1] 0 4 8 5 4
$`|Oklahoma||1/1/1996||1.000|||1996`
[1] 0 8 8 5 4
$`|Oregon||1/1/1990||1.000|||1990`
[1] 0 6 8 5 4
$`|Pennsylvania|6/17/1989||0.542|||1989`
[1] 0 12 9 5 4
$`|Philadelphia|10/11/1995||0.225|||1996`
[1] 0 12 10 5 4
$`|Rhode Island||N/A|N/A|0`
[1] 0 12 3 3 1
$`|South Carolina|8/23/1996||0.358|||1997`
[1] 0 14 9 5 4
$`|South Dakota| 7/1/1985||0.504|||1985`
[1] 0 12 9 5 4
$`|Tennessee| 10/1/1996||0.251|||1997`
[1] 0 9 10 5 4
$`|Texas||1/1/1996||1.000|||1996`
[1] 0 5 8 5 4
$`|Utah||5/1/1995||0.671|||1995`
[1] 0 4 8 5 4
$`|Vermont||1970|N/A|1970`
[1] 0 7 4 3 4
$`|Virginia| 5/5/1995||0.660|||1995`
[1] 0 8 9 5 4
$`|Washington||1961|N/A|1961`
[1] 0 10 4 3 4
$`|West Virginia|7/7/1989||0.488|||1990`
[1] 0 13 8 5 4
$`|Wisconsin| 11/1/2011||0.167|||2012`
[1] 0 9 10 5 4
$`|Wyoming||10/1/1994||0.252|||1995`
[1] 0 7 9 5 4
p_62 <- lapply(p_62, function(x) x[nchar(x) > 0])
p_62 <- as.data.frame(do.call(rbind, p_62))
rownames(p_62) [1] "|Alabama||1975|N/A|1975"
[2] "|Alaska||10/1/1994||0.252|||1995"
[3] "|Arizona| 7/17/1994||0.460|||1995"
[4] "|Arkansas| 7/27/1995||0.433|||1996"
[5] "|California||N/A|N/A|0"
[6] "|Colorado| 5/17/2003||0.627|||2003"
[7] "|Connecticut||1970|N/A|1970"
[8] "|Delaware||N/A|N/A|0"
[9] "District of Columbia|N/A|N/A|0"
[10] "|Florida| 10/1/1987||0.252|||1988"
[11] "|Georgia| 8/25/1989||0.353|||1990"
[12] "|Hawaii||N/A|N/A|0"
[13] "|Idaho||7/1/1990||0.504|||1990"
[14] "|Illinois| 1/5/2014|N/A|2014"
[15] "|Indiana| 1/15/1980||0.962|||1980"
[16] "|Iowa||1/1/2011||1.000|||2011"
[17] "|Kansas||1/1/2007||1.000|||2007"
[18] "|Kentucky| 10/1/1996||0.251|||1997"
[19] "|Louisiana|4/19/1996||0.702|||1996"
[20] "|Maine||9/19/1985||0.285|||1986"
[21] "|Maryland||N/A|N/A|0"
[22] "|Massachusetts||N/A|N/A|0"
[23] "|Michigan||7/1/2001||0.504|||2001"
[24] "|Minnesota| 5/28/2003||0.597|||2003"
[25] "|Mississippi|7/1/1990||0.504|||1990"
[26] "|Missouri| 2/26/2004||0.847|||2004"
[27] "|Montana||10/1/1991||0.252|||1992"
[28] "|Nebraska||1/1/2007||1.000|||2007"
[29] "|Nevada||10/1/1995||0.252|||1996"
[30] "|New Hampshire||1959|N/A|1959"
[31] "|New Jersey||N/A|N/A|0"
[32] "|New Mexico||1/1/2004||1.000|||2004"
[33] "|New York||N/A|N/A|0"
[34] "|North Carolina|12/1/1995||0.085|||1996"
[35] "|North Dakota| 8/1/1985||0.419|||1986"
[36] "|Ohio||4/8/2004||0.732|||2004"
[37] "|Oklahoma||1/1/1996||1.000|||1996"
[38] "|Oregon||1/1/1990||1.000|||1990"
[39] "|Pennsylvania|6/17/1989||0.542|||1989"
[40] "|Philadelphia|10/11/1995||0.225|||1996"
[41] "|Rhode Island||N/A|N/A|0"
[42] "|South Carolina|8/23/1996||0.358|||1997"
[43] "|South Dakota| 7/1/1985||0.504|||1985"
[44] "|Tennessee| 10/1/1996||0.251|||1997"
[45] "|Texas||1/1/1996||1.000|||1996"
[46] "|Utah||5/1/1995||0.671|||1995"
[47] "|Vermont||1970|N/A|1970"
[48] "|Virginia| 5/5/1995||0.660|||1995"
[49] "|Washington||1961|N/A|1961"
[50] "|West Virginia|7/7/1989||0.488|||1990"
[51] "|Wisconsin| 11/1/2011||0.167|||2012"
[52] "|Wyoming||10/1/1994||0.252|||1995"
rownames(p_62) <- c()
colnames(p_62) <- c("STATE",
"E_Date_RTC",
"Frac_Yr_Eff_Yr_Pass",
"RTC_Date_SA")
sapply(p_62, class) STATE E_Date_RTC Frac_Yr_Eff_Yr_Pass RTC_Date_SA
"character" "character" "character" "character"
p_62 <- p_62 %>%
dplyr::select(STATE, RTC_Date_SA) %>%
rename("RTC_LAW_YEAR"=RTC_Date_SA) %>%
mutate(RTC_LAW_YEAR = as.numeric(RTC_LAW_YEAR)) %>%
mutate(RTC_LAW_YEAR = case_when(RTC_LAW_YEAR == 0 ~ Inf,
TRUE ~ RTC_LAW_YEAR))
sapply(p_62, class) STATE RTC_LAW_YEAR
"character" "numeric"
STATE RTC_LAW_YEAR
1 Alabama 1975
2 Alaska 1995
3 Arizona 1995
4 Arkansas 1996
5 California Inf
6 Colorado 2003
[1] "YEAR" "STATE" "VARIABLE" "VALUE"
[1] "YEAR" "STATE" "VARIABLE" "VALUE"
[1] "STATE" "YEAR" "VALUE" "VARIABLE"
[1] "YEAR" "STATE" "VALUE" "VARIABLE"
[1] "YEAR" "VALUE" "STATE" "VARIABLE"
# A tibble: 6 x 4
YEAR STATE VARIABLE VALUE
<dbl> <chr> <chr> <dbl>
1 1990 Alabama Black_Male_15_to_19_years 1.23
2 1990 Alabama Black_Male_20_to_39_years 3.62
3 1990 Alabama Other_Male_15_to_19_years 0.0470
4 1990 Alabama Other_Male_20_to_39_years 0.176
5 1990 Alabama White_Male_15_to_19_years 2.70
6 1990 Alabama White_Male_20_to_39_years 11.6
# A tibble: 6 x 4
YEAR STATE VARIABLE VALUE
<dbl> <chr> <chr> <dbl>
1 1990 Alabama Black_Female_10_to_19_years 2.45
2 1990 Alabama Black_Female_20_to_29_years 2.20
3 1990 Alabama Black_Female_30_to_39_years 2.18
4 1990 Alabama Black_Female_40_to_49_years 1.37
5 1990 Alabama Black_Female_50_to_64_years 1.53
6 1990 Alabama Black_Female_65_years_and_over 1.67
# A tibble: 6 x 4
STATE YEAR VALUE VARIABLE
<chr> <dbl> <dbl> <chr>
1 Alabama 1977 7.3 Unemployment_rate
2 Alabama 1978 6.4 Unemployment_rate
3 Alabama 1979 7.2 Unemployment_rate
4 Alabama 1980 8.9 Unemployment_rate
5 Alabama 1981 10.6 Unemployment_rate
6 Alabama 1982 14.1 Unemployment_rate
# A tibble: 6 x 4
YEAR STATE VALUE VARIABLE
<dbl> <chr> <dbl> <chr>
1 2018 Alabama 16 Poverty_rate
2 2018 Alaska 13.1 Poverty_rate
3 2018 Arizona 12.8 Poverty_rate
4 2018 Arkansas 15.9 Poverty_rate
5 2018 California 11.9 Poverty_rate
6 2018 Colorado 9.1 Poverty_rate
# A tibble: 6 x 4
YEAR VALUE STATE VARIABLE
<dbl> <dbl> <chr> <chr>
1 1977 15293 Alabama Viol_crime_count
2 1978 15682 Alabama Viol_crime_count
3 1979 15578 Alabama Viol_crime_count
4 1980 17320 Alabama Viol_crime_count
5 1981 18423 Alabama Viol_crime_count
6 1982 17653 Alabama Viol_crime_count
DONOHUE_DF <- bind_rows(dem_DONOHUE,
ue_rate_data,
poverty_rate_data,
crime_data,
population_data,
ps_data) %>%
pivot_wider(names_from = "VARIABLE",
values_from = "VALUE") %>%
left_join(p_62 , by = c("STATE")) %>%
mutate(RTC_LAW = case_when(YEAR >= RTC_LAW_YEAR ~ TRUE,
TRUE ~ FALSE))
DONOHUE_DF %>%
group_by(YEAR) %>%
tally() %>%
filter(n != 51) %>%
print(n=dim(.)[1])# A tibble: 33 x 2
YEAR n
<dbl> <int>
1 1980 52
2 1981 52
3 1982 52
4 1983 52
5 1984 52
6 1985 52
7 1986 52
8 1987 52
9 1988 52
10 1989 52
11 1990 52
12 1991 52
13 1992 52
14 1993 52
15 1994 52
16 1995 52
17 1996 52
18 1997 52
19 1998 52
20 1999 52
21 2000 52
22 2001 52
23 2002 52
24 2003 52
25 2004 52
26 2005 52
27 2006 52
28 2007 52
29 2008 52
30 2009 52
31 2010 52
32 2011 52
33 2012 52
Alabama Alaska Arizona
44 44 44
Arkansas California Colorado
44 44 44
Connecticut D.C. Delaware
44 33 44
District of Columbia Florida Georgia
44 44 44
Hawaii Idaho Illinois
44 44 44
Indiana Iowa Kansas
44 44 44
Kentucky Louisiana Maine
44 44 44
Maryland Massachusetts Michigan
44 44 44
Minnesota Mississippi Missouri
44 44 44
Montana Nebraska Nevada
44 44 44
New Hampshire New Jersey New Mexico
44 44 44
New York North Carolina North Dakota
44 44 44
Ohio Oklahoma Oregon
44 44 44
Pennsylvania Rhode Island South Carolina
44 44 44
South Dakota Tennessee Texas
44 44 44
Utah Vermont Virginia
44 44 44
Washington West Virginia Wisconsin
44 44 44
Wyoming
44
[1] 44
DONOHUE_DF <- DONOHUE_DF %>%
mutate(STATE = fct_collapse(STATE, "District of Columbia"=c("District of Columbia","D.C.")))
summary(as.factor(DONOHUE_DF$STATE)) Alabama Alaska Arizona
44 44 44
Arkansas California Colorado
44 44 44
Connecticut District of Columbia Delaware
44 77 44
Florida Georgia Hawaii
44 44 44
Idaho Illinois Indiana
44 44 44
Iowa Kansas Kentucky
44 44 44
Louisiana Maine Maryland
44 44 44
Massachusetts Michigan Minnesota
44 44 44
Mississippi Missouri Montana
44 44 44
Nebraska Nevada New Hampshire
44 44 44
New Jersey New Mexico New York
44 44 44
North Carolina North Dakota Ohio
44 44 44
Oklahoma Oregon Pennsylvania
44 44 44
Rhode Island South Carolina South Dakota
44 44 44
Tennessee Texas Utah
44 44 44
Vermont Virginia Washington
44 44 44
West Virginia Wisconsin Wyoming
44 44 44
[1] 51
DONOHUE_DF <- DONOHUE_DF %>%
group_by(STATE, YEAR) %>%
summarise_all(~na.omit(unique(.))) %>%
ungroup() # This identifies unique observations, coalesces rows according to the grouping variable(s), and gets rid of of units that have incomplete data. This gives returns a dataframe with the most complete information.
summary(as.factor(DONOHUE_DF$STATE)) Alabama Alaska Arizona
21 21 21
Arkansas California Colorado
21 21 21
Connecticut District of Columbia Delaware
21 21 21
Florida Georgia Hawaii
21 21 21
Idaho Illinois Indiana
21 21 21
Iowa Kansas Kentucky
21 21 21
Louisiana Maine Maryland
21 21 21
Massachusetts Michigan Minnesota
21 21 21
Mississippi Missouri Montana
21 21 21
Nebraska Nevada New Hampshire
21 21 21
New Jersey New Mexico New York
21 21 21
North Carolina North Dakota Ohio
21 21 21
Oklahoma Oregon Pennsylvania
21 21 21
Rhode Island South Carolina South Dakota
21 21 21
Tennessee Texas Utah
21 21 21
Vermont Virginia Washington
21 21 21
West Virginia Wisconsin Wyoming
21 21 21
baseline_year <- min(DONOHUE_DF$YEAR)
censoring_year <- max(DONOHUE_DF$YEAR)
# Need to fix this to ensure severe bias is not introduced by prevalent "cases"
DONOHUE_DF <- DONOHUE_DF %>%
mutate(TIME_0 = baseline_year,
TIME_INF = censoring_year) %>%
filter(RTC_LAW_YEAR > TIME_0)
DONOHUE_DF <- DONOHUE_DF %>%
mutate(Viol_crime_rate_1k = (Viol_crime_count*1000)/Population,
Viol_crime_rate_1k_log = log(Viol_crime_rate_1k),
Population_log = log(Population))
summary(droplevels(as.factor(DONOHUE_DF$STATE))) Alaska Arizona Arkansas
21 21 21
California Colorado District of Columbia
21 21 21
Delaware Hawaii Illinois
21 21 21
Iowa Kansas Kentucky
21 21 21
Louisiana Maryland Massachusetts
21 21 21
Michigan Minnesota Missouri
21 21 21
Montana Nebraska Nevada
21 21 21
New Jersey New Mexico New York
21 21 21
North Carolina Ohio Oklahoma
21 21 21
Rhode Island South Carolina Tennessee
21 21 21
Texas Utah Virginia
21 21 21
Wisconsin Wyoming
21 21
[1] 35
LOTT_DF <- bind_rows(dem_LOTT,
ue_rate_data,
poverty_rate_data,
crime_data,
population_data,
ps_data) %>%
pivot_wider(names_from = "VARIABLE",
values_from = "VALUE") %>%
left_join(p_62 , by = c("STATE")) %>%
mutate(RTC_LAW = case_when(YEAR >= RTC_LAW_YEAR ~ TRUE,
TRUE ~ FALSE))
LOTT_DF %>%
group_by(YEAR) %>%
tally() %>%
filter(n != 51) %>%
print(n=dim(.)[1])# A tibble: 33 x 2
YEAR n
<dbl> <int>
1 1980 52
2 1981 52
3 1982 52
4 1983 52
5 1984 52
6 1985 52
7 1986 52
8 1987 52
9 1988 52
10 1989 52
11 1990 52
12 1991 52
13 1992 52
14 1993 52
15 1994 52
16 1995 52
17 1996 52
18 1997 52
19 1998 52
20 1999 52
21 2000 52
22 2001 52
23 2002 52
24 2003 52
25 2004 52
26 2005 52
27 2006 52
28 2007 52
29 2008 52
30 2009 52
31 2010 52
32 2011 52
33 2012 52
Alabama Alaska Arizona
44 44 44
Arkansas California Colorado
44 44 44
Connecticut D.C. Delaware
44 33 44
District of Columbia Florida Georgia
44 44 44
Hawaii Idaho Illinois
44 44 44
Indiana Iowa Kansas
44 44 44
Kentucky Louisiana Maine
44 44 44
Maryland Massachusetts Michigan
44 44 44
Minnesota Mississippi Missouri
44 44 44
Montana Nebraska Nevada
44 44 44
New Hampshire New Jersey New Mexico
44 44 44
New York North Carolina North Dakota
44 44 44
Ohio Oklahoma Oregon
44 44 44
Pennsylvania Rhode Island South Carolina
44 44 44
South Dakota Tennessee Texas
44 44 44
Utah Vermont Virginia
44 44 44
Washington West Virginia Wisconsin
44 44 44
Wyoming
44
[1] 44
LOTT_DF <- LOTT_DF %>%
mutate(STATE = fct_collapse(STATE, "District of Columbia"=c("District of Columbia","D.C.")))
summary(as.factor(LOTT_DF$STATE)) Alabama Alaska Arizona
44 44 44
Arkansas California Colorado
44 44 44
Connecticut District of Columbia Delaware
44 77 44
Florida Georgia Hawaii
44 44 44
Idaho Illinois Indiana
44 44 44
Iowa Kansas Kentucky
44 44 44
Louisiana Maine Maryland
44 44 44
Massachusetts Michigan Minnesota
44 44 44
Mississippi Missouri Montana
44 44 44
Nebraska Nevada New Hampshire
44 44 44
New Jersey New Mexico New York
44 44 44
North Carolina North Dakota Ohio
44 44 44
Oklahoma Oregon Pennsylvania
44 44 44
Rhode Island South Carolina South Dakota
44 44 44
Tennessee Texas Utah
44 44 44
Vermont Virginia Washington
44 44 44
West Virginia Wisconsin Wyoming
44 44 44
[1] 51
LOTT_DF <- LOTT_DF %>%
group_by(STATE, YEAR) %>%
summarise_all(~na.omit(unique(.))) %>%
ungroup() # This identifies unique observations, coalesces rows according to the grouping variable(s), and gets rid of of units that have incomplete data. This gives returns a dataframe with the most complete information.
summary(as.factor(LOTT_DF$STATE)) Alabama Alaska Arizona
21 21 21
Arkansas California Colorado
21 21 21
Connecticut District of Columbia Delaware
21 21 21
Florida Georgia Hawaii
21 21 21
Idaho Illinois Indiana
21 21 21
Iowa Kansas Kentucky
21 21 21
Louisiana Maine Maryland
21 21 21
Massachusetts Michigan Minnesota
21 21 21
Mississippi Missouri Montana
21 21 21
Nebraska Nevada New Hampshire
21 21 21
New Jersey New Mexico New York
21 21 21
North Carolina North Dakota Ohio
21 21 21
Oklahoma Oregon Pennsylvania
21 21 21
Rhode Island South Carolina South Dakota
21 21 21
Tennessee Texas Utah
21 21 21
Vermont Virginia Washington
21 21 21
West Virginia Wisconsin Wyoming
21 21 21
baseline_year <- min(LOTT_DF$YEAR)
censoring_year <- max(LOTT_DF$YEAR)
# Need to fix this to ensure severe bias is not introduced by prevalent "cases"
LOTT_DF <- LOTT_DF %>%
mutate(TIME_0 = baseline_year,
TIME_INF = censoring_year) %>%
filter(RTC_LAW_YEAR > TIME_0)
LOTT_DF <- LOTT_DF %>%
mutate(Viol_crime_rate_1k = (Viol_crime_count*1000)/Population,
Viol_crime_rate_1k_log = log(Viol_crime_rate_1k),
Population_log = log(Population))
summary(droplevels(as.factor(LOTT_DF$STATE))) Alaska Arizona Arkansas
21 21 21
California Colorado District of Columbia
21 21 21
Delaware Hawaii Illinois
21 21 21
Iowa Kansas Kentucky
21 21 21
Louisiana Maryland Massachusetts
21 21 21
Michigan Minnesota Missouri
21 21 21
Montana Nebraska Nevada
21 21 21
New Jersey New Mexico New York
21 21 21
North Carolina Ohio Oklahoma
21 21 21
Rhode Island South Carolina Tennessee
21 21 21
Texas Utah Virginia
21 21 21
Wisconsin Wyoming
21 21
[1] 35
STATE YEAR Black_Male_15_to_19_years
"factor" "numeric" "numeric"
Black_Male_20_to_39_years Other_Male_15_to_19_years Other_Male_20_to_39_years
"numeric" "numeric" "numeric"
White_Male_15_to_19_years White_Male_20_to_39_years Unemployment_rate
"numeric" "numeric" "numeric"
Poverty_rate Viol_crime_count Population
"numeric" "numeric" "numeric"
police_per_100k_lag RTC_LAW_YEAR RTC_LAW
"numeric" "numeric" "logical"
TIME_0 TIME_INF Viol_crime_rate_1k
"numeric" "numeric" "numeric"
Viol_crime_rate_1k_log Population_log
"numeric" "numeric"
DONOHUE_DF %>%
mutate(Viol_crime_rate_100k_log = log((Viol_crime_count*100000)/Population)) %>%
ggplot(aes(x = YEAR, y = Viol_crime_rate_100k_log, color = STATE)) +
geom_point(size = 0.5) +
geom_line(aes(group=STATE),
size = 0.5,
show.legend = FALSE) +
geom_text_repel(data = DONOHUE_DF %>%
mutate(Viol_crime_rate_100k_log = log((Viol_crime_count*100000)/Population)) %>%
filter(YEAR == last(YEAR)),
aes(label = STATE,
x = YEAR,
y = Viol_crime_rate_100k_log),
size = 3,
alpha = 1,
nudge_x = 10,
direction = "y",
hjust = 1,
vjust = 1,
segment.size = 0.25,
segment.alpha = 0.25,
force = 1,
max.iter = 9999) +
guides(color = FALSE) +
scale_x_continuous(breaks = seq(1980, 2015, by = 1),
limits = c(1980, 2015),
labels = c(seq(1980, 2010, by = 1), rep("", 5))) +
scale_y_continuous(breaks = seq(3.5, 8.5, by = 0.5),
limits = c(3.5, 8.5)) +
labs(title = "States have different levels of crime",
x = "Year",
y = "ln(violent crimes per 100,000 people)") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90))DONOHUE_DF %>%
group_by(YEAR) %>%
summarise(Viol_crime_count = sum(Viol_crime_count),
Population = sum(Population),
.groups = "drop") %>%
mutate(Viol_crime_rate_100k_log = log((Viol_crime_count*100000)/Population)) %>%
ggplot(aes(x = YEAR, y = Viol_crime_rate_100k_log)) +
geom_line() +
scale_x_continuous(breaks = seq(1980, 2010, by = 1),
limits = c(1980, 2010),
labels = c(seq(1980, 2010, by = 1))) +
scale_y_continuous(breaks = seq(5.75, 6.75, by = 0.25),
limits = c(5.75, 6.75)) +
labs(title = "Crime rates fluctuate over time",
x = "Year",
y = "ln(violent crimes per 100,000 people)") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90))Some code taken from http://karthur.org/2019/implementing-fixed-effects-panel-models-in-r.html
d_panel_DONOHUE <- pdata.frame(DONOHUE_DF, index=c("STATE", "YEAR"))
DONOHUE_OUTPUT <- plm(Viol_crime_rate_1k_log ~
RTC_LAW +
White_Male_15_to_19_years +
White_Male_20_to_39_years +
Black_Male_15_to_19_years +
Black_Male_20_to_39_years +
Other_Male_15_to_19_years +
Other_Male_20_to_39_years +
Unemployment_rate +
Poverty_rate +
Population_log +
police_per_100k_lag,
effect = "twoways",
model = "within",
data=d_panel_DONOHUE)
summary(DONOHUE_OUTPUT)Twoways effects Within Model
Call:
plm(formula = Viol_crime_rate_1k_log ~ RTC_LAW + White_Male_15_to_19_years +
White_Male_20_to_39_years + Black_Male_15_to_19_years + Black_Male_20_to_39_years +
Other_Male_15_to_19_years + Other_Male_20_to_39_years + Unemployment_rate +
Poverty_rate + Population_log + police_per_100k_lag, data = d_panel_DONOHUE,
effect = "twoways", model = "within")
Balanced Panel: n = 35, T = 21, N = 735
Residuals:
Min. 1st Qu. Median 3rd Qu. Max.
-0.6338573 -0.0622728 -0.0032734 0.0600340 0.4747461
Coefficients:
Estimate Std. Error t-value Pr(>|t|)
RTC_LAWTRUE 0.01399747 0.01807103 0.7746 0.4388610
White_Male_15_to_19_years -0.07163612 0.04256304 -1.6831 0.0928302 .
White_Male_20_to_39_years 0.04735001 0.01368064 3.4611 0.0005722 ***
Black_Male_15_to_19_years 0.14345532 0.08877101 1.6160 0.1065624
Black_Male_20_to_39_years 0.13841529 0.02430620 5.6946 1.852e-08 ***
Other_Male_15_to_19_years 0.85500263 0.11590669 7.3766 4.822e-13 ***
Other_Male_20_to_39_years -0.25713549 0.04257069 -6.0402 2.554e-09 ***
Unemployment_rate 0.00221657 0.00654484 0.3387 0.7349615
Poverty_rate 0.00392491 0.00332665 1.1798 0.2384832
Population_log 0.23504152 0.09612692 2.4451 0.0147374 *
police_per_100k_lag 0.00054079 0.00017321 3.1222 0.0018723 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Total Sum of Squares: 12.935
Residual Sum of Squares: 10.268
R-Squared: 0.20621
Adj. R-Squared: 0.12909
F-statistic: 15.7996 on 11 and 669 DF, p-value: < 2.22e-16
Some code taken from http://karthur.org/2019/implementing-fixed-effects-panel-models-in-r.html
LOTT_variables <- LOTT_DF %>%
dplyr::select(RTC_LAW,
contains(c("White","Black","Other")),
Unemployment_rate,
Poverty_rate,
Population_log,
police_per_100k_lag) %>%
colnames()
LOTT_fmla <- as.formula(paste("Viol_crime_rate_1k_log ~",
paste(LOTT_variables, collapse = " + ")
)
)
d_panel_LOTT <- pdata.frame(LOTT_DF, index=c("STATE", "YEAR"))
LOTT_OUTPUT <- plm(LOTT_fmla,
model = "within",
effect = "twoways",
data=d_panel_LOTT)
summary(LOTT_OUTPUT)Twoways effects Within Model
Call:
plm(formula = LOTT_fmla, data = d_panel_LOTT, effect = "twoways",
model = "within")
Balanced Panel: n = 35, T = 21, N = 735
Residuals:
Min. 1st Qu. Median 3rd Qu. Max.
-0.46717726 -0.05325358 0.00074259 0.05557220 0.31537243
Coefficients:
Estimate Std. Error t-value Pr(>|t|)
RTC_LAWTRUE -0.02936965 0.01823854 -1.6103 0.1078246
White_Female_10_to_19_years 0.65921612 0.20489684 3.2173 0.0013594 **
White_Female_20_to_29_years 0.08644467 0.07612062 1.1356 0.2565382
White_Female_30_to_39_years -0.13836498 0.10773830 -1.2843 0.1995132
White_Female_40_to_49_years 0.47203824 0.10700674 4.4113 1.206e-05 ***
White_Female_50_to_64_years -0.43678487 0.09390272 -4.6515 4.007e-06 ***
White_Female_65_years_and_over -0.28217891 0.08930520 -3.1597 0.0016538 **
White_Male_10_to_19_years -0.58796604 0.19526018 -3.0112 0.0027049 **
White_Male_20_to_29_years 0.02921390 0.07057479 0.4139 0.6790551
White_Male_30_to_39_years 0.07899878 0.10478693 0.7539 0.4511875
White_Male_40_to_49_years -0.44302519 0.09301896 -4.7627 2.365e-06 ***
White_Male_50_to_64_years 0.35259054 0.09757265 3.6136 0.0003257 ***
White_Male_65_years_and_over 0.46833455 0.13366773 3.5037 0.0004908 ***
Black_Female_10_to_19_years -0.46902450 0.43012668 -1.0904 0.2759333
Black_Female_20_to_29_years -0.85330286 0.19903094 -4.2873 2.088e-05 ***
Black_Female_30_to_39_years 0.67000107 0.29117489 2.3010 0.0217112 *
Black_Female_40_to_49_years 0.20211428 0.24593810 0.8218 0.4114918
Black_Female_50_to_64_years 0.27010596 0.24815423 1.0885 0.2768024
Black_Female_65_years_and_over -0.12661080 0.36916618 -0.3430 0.7317382
Black_Male_10_to_19_years 0.74235062 0.44292888 1.6760 0.0942266 .
Black_Male_20_to_29_years 0.79460625 0.21556244 3.6862 0.0002470 ***
Black_Male_30_to_39_years -0.29673477 0.30748553 -0.9650 0.3348916
Black_Male_40_to_49_years -0.14562558 0.30662465 -0.4749 0.6349984
Black_Male_50_to_64_years -0.14915192 0.27623210 -0.5400 0.5894186
Black_Male_65_years_and_over -0.96725742 0.58296284 -1.6592 0.0975643 .
Other_Female_10_to_19_years -2.24967607 0.57904598 -3.8851 0.0001129 ***
Other_Female_20_to_29_years -0.27537653 0.30109748 -0.9146 0.3607592
Other_Female_30_to_39_years -2.45674942 0.40420766 -6.0779 2.093e-09 ***
Other_Female_40_to_49_years 1.44842440 0.52993251 2.7332 0.0064453 **
Other_Female_50_to_64_years 1.61920629 0.40211618 4.0267 6.336e-05 ***
Other_Female_65_years_and_over 1.40148342 0.29971128 4.6761 3.569e-06 ***
Other_Male_10_to_19_years 2.48313426 0.54927918 4.5207 7.345e-06 ***
Other_Male_20_to_29_years 0.25699446 0.29912664 0.8591 0.3905803
Other_Male_30_to_39_years 2.49248478 0.43813341 5.6889 1.948e-08 ***
Other_Male_40_to_49_years -1.53468291 0.51566283 -2.9761 0.0030294 **
Other_Male_50_to_64_years -2.01173714 0.48027699 -4.1887 3.200e-05 ***
Other_Male_65_years_and_over -2.05964386 0.42138889 -4.8878 1.291e-06 ***
Unemployment_rate -0.00605069 0.00626903 -0.9652 0.3348243
Poverty_rate 0.00210437 0.00285683 0.7366 0.4616286
Population_log 0.40657802 0.14084185 2.8868 0.0040237 **
police_per_100k_lag 0.00021813 0.00016227 1.3443 0.1793315
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Total Sum of Squares: 12.935
Residual Sum of Squares: 6.4494
R-Squared: 0.5014
Adj. R-Squared: 0.42727
F-statistic: 15.6726 on 41 and 639 DF, p-value: < 2.22e-16
comparing_analyses <- DONOHUE_OUTPUT_TIDY %>%
bind_rows(LOTT_OUTPUT_TIDY) %>%
filter(term == "RTC_LAWTRUE")
library(grid)
comparing_analyses_plot <- ggplot(comparing_analyses) +
geom_point(aes(x = Analysis, y = estimate)) +
geom_errorbar(aes(x = Analysis, ymin = conf.low, ymax = conf.high), width = 0.25) +
geom_hline(yintercept = 0, color = "red") +
scale_y_continuous(breaks = seq(-0.2, 0.2, by = 0.05),
labels = seq(-0.2, 0.2, by = 0.05),
limits = c(-0.2,0.2)) +
geom_segment(aes(x = 1, y = 0.125, xend = 1, yend = 0.175),
arrow = arrow(angle = 45, ends = "last", type = "open"),
size = 2,
color = "green",
lineend = "butt",
linejoin = "mitre") +
geom_segment(aes(x = 2, y = -0.125, xend = 2, yend = -0.175),
arrow = arrow(angle = 45, ends = "last", type = "open"),
size = 2,
color = "red",
lineend = "butt",
linejoin = "mitre") +
theme_minimal() +
theme(axis.title.x = element_blank(),
axis.text = element_text(size = 12)) +
labs(title = "Effect estimate on ln(violent crimes per 100,000 people)",
y = "Effect estimate (95% CI)")
comparing_analyses_plotHow did the above happen?
The analysis dataframes are very similar yet rendered very different results.
- different number of columns: 20 vs 50
[1] TRUE
The only difference between the two dataframes rests in how the demographic variables were parameterized.
[1] "Black_Male_15_to_19_years" "Black_Male_20_to_39_years"
[3] "Other_Male_15_to_19_years" "Other_Male_20_to_39_years"
[5] "White_Male_15_to_19_years" "White_Male_20_to_39_years"
[1] "Black_Female_10_to_19_years" "Black_Female_20_to_29_years"
[3] "Black_Female_30_to_39_years" "Black_Female_40_to_49_years"
[5] "Black_Female_50_to_64_years" "Black_Female_65_years_and_over"
[7] "Black_Male_10_to_19_years" "Black_Male_20_to_29_years"
[9] "Black_Male_30_to_39_years" "Black_Male_40_to_49_years"
[11] "Black_Male_50_to_64_years" "Black_Male_65_years_and_over"
[13] "Other_Female_10_to_19_years" "Other_Female_20_to_29_years"
[15] "Other_Female_30_to_39_years" "Other_Female_40_to_49_years"
[17] "Other_Female_50_to_64_years" "Other_Female_65_years_and_over"
[19] "Other_Male_10_to_19_years" "Other_Male_20_to_29_years"
[21] "Other_Male_30_to_39_years" "Other_Male_40_to_49_years"
[23] "Other_Male_50_to_64_years" "Other_Male_65_years_and_over"
[25] "White_Female_10_to_19_years" "White_Female_20_to_29_years"
[27] "White_Female_30_to_39_years" "White_Female_40_to_49_years"
[29] "White_Female_50_to_64_years" "White_Female_65_years_and_over"
[31] "White_Male_10_to_19_years" "White_Male_20_to_29_years"
[33] "White_Male_30_to_39_years" "White_Male_40_to_49_years"
[35] "White_Male_50_to_64_years" "White_Male_65_years_and_over"
Clearly, this had an effect on the results of the analysis.
Let’s explore how this occured.
When seemingly independent variables are highly related to one another, the relationships estimated in an analysis may be distorted.
In regression analysis, this distortion is often a byproduct of a violation of the independence assumption. This distortion, if large enough, can impact statistical inference.
There are several ways we can diagnose multicollinearity.
Again, multicollinearity often occurs when independent variables are highly related to one another. Consequently, we can evaluate these relationships be examining the correlation between variable pairs.
It is important to note that multicollinearity and correlation are not one and the same. Correlation can be thought of as the strength of the relationship between variables. On the other hand, multicollinearity can be thought of the the violation of the independence assumption that is a consequence of this correlation in a regression analysis.
[1] "STATE" "YEAR"
[3] "Black_Male_15_to_19_years" "Black_Male_20_to_39_years"
[5] "Other_Male_15_to_19_years" "Other_Male_20_to_39_years"
[7] "White_Male_15_to_19_years" "White_Male_20_to_39_years"
[9] "Unemployment_rate" "Poverty_rate"
[11] "Viol_crime_count" "Population"
[13] "police_per_100k_lag" "RTC_LAW_YEAR"
[15] "RTC_LAW" "TIME_0"
[17] "TIME_INF" "Viol_crime_rate_1k"
[19] "Viol_crime_rate_1k_log" "Population_log"
DONOHUE_DF %>%
dplyr::select(RTC_LAW,
Viol_crime_rate_1k_log,
Unemployment_rate,
Poverty_rate,
Population_log) %>%
ggpairs(.,
columns = c(2:5),
lower = list(continuous = wrap("smooth_loess",
color = "red",
alpha = 0.5,
size = 0.1)))LOTT_DF %>%
dplyr::select(RTC_LAW,
Viol_crime_rate_1k_log,
Unemployment_rate,
Poverty_rate,
Population_log) %>%
ggpairs(.,
columns = c(2:5),
lower = list(continuous = wrap("smooth_loess",
color = "red",
alpha = 0.5,
size = 0.1)))cor_DONOHUE_dem <- cor(DONOHUE_DF %>% dplyr::select(contains("_years")))
corr_mat_DONOHUE <- ggcorrplot(cor_DONOHUE_dem,
tl.cex = 6,
hc.order = TRUE,
colors = c("red",
"white",
"red"),
outline.color = "transparent",
title = "Correlation Matrix, Analysis 1",
legend.title = TeX("$\\rho$"))
corr_mat_DONOHUEcor_LOTT_dem <- cor(LOTT_DF %>% dplyr::select(contains("_years")))
corr_mat_LOTT <- ggcorrplot(cor_LOTT_dem,
tl.cex = 6,
hc.order = TRUE,
colors = c("red",
"white",
"red"),
outline.color = "transparent",
title = "Correlation Matrix, Analysis 2",
legend.title = TeX("$\\rho$"))
corr_mat_LOTTsims <- 250
# DONOHUE
# round(dim(DONOHUE_DF)[1]/2)
samps_DONOHUE <- lapply(rep(dim(DONOHUE_DF)[1]-1, sims),
function(x)DONOHUE_DF[sample(nrow(DONOHUE_DF),
size = x, replace = FALSE),])
fit_nls_on_bootstrap_DONOHUE <- function(split){
plm(Viol_crime_rate_1k_log ~
RTC_LAW +
White_Male_15_to_19_years +
White_Male_20_to_39_years +
Black_Male_15_to_19_years +
Black_Male_20_to_39_years +
Other_Male_15_to_19_years +
Other_Male_20_to_39_years +
Unemployment_rate +
Poverty_rate +
Population_log +
police_per_100k_lag,
data = data.frame(split),
index = c("STATE","YEAR"),
model = "within",
effect = "twoways")
}
samps_models_DONOHUE <- lapply(samps_DONOHUE, fit_nls_on_bootstrap_DONOHUE)
samps_models_DONOHUE <- samps_models_DONOHUE %>%
map(tidy)
names(samps_models_DONOHUE) <- paste0("DONOHUE_",1:length(samps_models_DONOHUE))
simulations_DONOHUE <- samps_models_DONOHUE %>%
bind_rows(.id = "ID") %>%
mutate(Analysis = "Analysis 1")
## LOTT
samps_LOTT <- lapply(rep(round(dim(LOTT_DF)[1]/2), sims),
function(x) LOTT_DF[sample(nrow(LOTT_DF),
size = x, replace = FALSE),])
fit_nls_on_bootstrap_LOTT <- function(split){
plm(LOTT_fmla,
data = data.frame(split),
index = c("STATE","YEAR"),
model = "within",
effect = "twoways")
}
samps_models_LOTT <- lapply(samps_LOTT, fit_nls_on_bootstrap_LOTT)
samps_models_LOTT <- samps_models_LOTT %>%
map(tidy)
names(samps_models_LOTT) <- paste0("LOTT_",1:length(samps_models_LOTT))
simulations_LOTT <- samps_models_LOTT %>%
bind_rows(.id = "Analysis") %>%
mutate(Analysis = "Analysis 2")
simulations <- bind_rows(simulations_DONOHUE,
simulations_LOTT)
simulation_plot <- simulations %>%
filter(term=="RTC_LAWTRUE") %>%
ggplot(aes(x = Analysis, y = estimate)) +
geom_jitter(alpha = 0.25,
width = 0.1) +
labs(title = "Coefficient instability",
subtitle = "Estimates sensitive to observation deletions",
x = "Term",
y = "Coefficient",
caption = "Results from simulations") +
theme_minimal() +
theme(axis.title.x = element_blank())
simulation_plotdesign.matrix <- as.data.frame(model.matrix(DONOHUE_OUTPUT))
design.matrix$Viol_crime_rate_1k_log <- plm::Within(
d_panel_DONOHUE$Viol_crime_rate_1k_log)
lm_DONOHUE <- lm(Viol_crime_rate_1k_log ~
RTC_LAWTRUE + # logical class changes variable name after inital model
White_Male_15_to_19_years +
White_Male_20_to_39_years +
Black_Male_15_to_19_years +
Black_Male_20_to_39_years +
Other_Male_15_to_19_years +
Other_Male_20_to_39_years +
Unemployment_rate +
Poverty_rate +
Population_log +
police_per_100k_lag,
data = design.matrix)
vif(lm_DONOHUE) RTC_LAWTRUE White_Male_15_to_19_years White_Male_20_to_39_years
1.157336 1.753266 2.608183
Black_Male_15_to_19_years Black_Male_20_to_39_years Other_Male_15_to_19_years
1.332589 2.283867 1.681014
Other_Male_20_to_39_years Unemployment_rate Poverty_rate
1.407883 1.311134 1.198027
Population_log police_per_100k_lag
1.176024 1.117583
vif_DONOHUE <- vif(lm_DONOHUE)
vif_DONOHUE <- vif_DONOHUE %>%
as_tibble() %>%
cbind(., names(vif_DONOHUE)) %>%
as_tibble()
colnames(vif_DONOHUE) <- c("VIF", "Variable")
max_vif_DONOHUE <- max(vif(lm_DONOHUE)) design.matrix <- as.data.frame(model.matrix(LOTT_OUTPUT))
design.matrix$Viol_crime_rate_1k_log <- plm::Within(
d_panel_LOTT$Viol_crime_rate_1k_log)
LOTT_variables_ols <- LOTT_DF %>%
dplyr::select(RTC_LAW,
contains(c("White","Black","Other")),
Unemployment_rate,
Poverty_rate,
Population_log,
police_per_100k_lag) %>%
colnames() %>%
str_replace("RTC_LAW", "RTC_LAWTRUE") # logical class changes variable name after inital model
LOTT_fmla_ols <- as.formula(paste("Viol_crime_rate_1k_log ~",
paste(LOTT_variables_ols, collapse = " + ")
)
)
lm_LOTT <- lm(LOTT_fmla_ols,
data = design.matrix)
vif(lm_LOTT) RTC_LAWTRUE White_Female_10_to_19_years
1.792652 175.910307
White_Female_20_to_29_years White_Female_30_to_39_years
38.695213 79.741526
White_Female_40_to_49_years White_Female_50_to_64_years
29.102734 41.365488
White_Female_65_years_and_over White_Male_10_to_19_years
22.912619 180.294231
White_Male_20_to_29_years White_Male_30_to_39_years
37.977938 92.987226
White_Male_40_to_49_years White_Male_50_to_64_years
30.083073 55.343121
White_Male_65_years_and_over Black_Female_10_to_19_years
27.351523 140.154114
Black_Female_20_to_29_years Black_Female_30_to_39_years
56.959351 171.914269
Black_Female_40_to_49_years Black_Female_50_to_64_years
46.707169 71.514828
Black_Female_65_years_and_over Black_Male_10_to_19_years
106.967399 131.486665
Black_Male_20_to_29_years Black_Male_30_to_39_years
63.713636 183.271248
Black_Male_40_to_49_years Black_Male_50_to_64_years
43.494574 66.342952
Black_Male_65_years_and_over Other_Female_10_to_19_years
92.090941 184.637460
Other_Female_20_to_29_years Other_Female_30_to_39_years
51.812396 63.776558
Other_Female_40_to_49_years Other_Female_50_to_64_years
141.953648 234.097628
Other_Female_65_years_and_over Other_Male_10_to_19_years
60.354538 212.258459
Other_Male_20_to_29_years Other_Male_30_to_39_years
59.985342 65.956192
Other_Male_40_to_49_years Other_Male_50_to_64_years
108.867459 384.831305
Other_Male_65_years_and_over Unemployment_rate
17.361666 1.829250
Poverty_rate Population_log
1.343519 3.838949
police_per_100k_lag
1.491501
vif_LOTT <- vif(lm_LOTT)
vif_LOTT <- vif_LOTT %>%
as_tibble() %>%
cbind(., names(vif_LOTT)) %>%
as_tibble()
colnames(vif_LOTT) <- c("VIF", "Variable")
max_vif_LOTT <- max(vif(lm_LOTT))[1] 2.608183
[1] 384.8313
\[\frac{1}{1-R_{i}^{2}}\]
vif_DONOHUE$Analysis <- "Analysis 1"
vif_LOTT$Analysis <- "Analysis 2"
vif_df <- rbind(vif_DONOHUE,
vif_LOTT)
vif_plot <- vif_df %>%
ggplot(aes(x = Analysis, y = VIF)) +
geom_jitter(width = 0.1, alpha = 0.5, size = 2) +
geom_hline(yintercept = 10, color = "red") +
scale_y_continuous(trans = 'log10',
limits = c(1,1000)) +
labs(title = "Variance inflation factors") +
theme_minimal() +
theme(axis.title.x = element_blank())
vif_plothttps://rpubs.com/rslbliss/fixed_effects
http://karthur.org/2019/implementing-fixed-effects-panel-models-in-r.html
https://stats.stackexchange.com/questions/99236/effects-in-panel-models-individual-time-or-twoways